



Preview text:
2016 2nd IEEE International Conference on Computer and Communications
A Hybrid Method for Service Identification of SSL/TLS Encrypted Traffic Rusheng Ding, Wenmin Li
State Key Laboratory of Networking and Switching Technology
Beijing University of Posts and Telecommunications Beijing, China
e-mail: rosending@163.com.liwenmin02@outlook.com
Abstract-It is significant and helpful to analyze encrypted
traffic for solution of the potential network security problem. II. RELATED WORK
As one of the most popular encryption protocol, SSL/TLS has
Recently, most research of the SSLlTLS traffic
provided useful support for various network security while the
classification mainly concentrated in two aspects. On the one
network traffic has become complex and diverse. However,
hand, it is to detect whether the traffic is SSLlTLS or not. On
most of research on SSL/TLS traffic is about protocol level,
the other hand, it is to classify different applications, such as
identification and classification of services in real time is HTTPS, SMTPS, etc.
required for the effective network traffic management. In this
In [2], it designed to detect TLS network traffic from
paper, we propose a hybrid method to identify and classify the
unknown traffic, algorithms what the authors have used
service in real time. We first filter the HTTPS traffic with a
machine learning algorithm C4.5 decision tree, then classify
include AdaBoost, C4.5, RIPPER and Navie Bayes, they
the services with random forest, and process the traffic data
also extract length of packets, interval-time, duration and
what we get. Experimental results have shown that the
packets count as characteristic values for identification. In accuracy reach more than 95%.
[3], the author used subparameters of packet interval-time
and packet size (such as the maximum, minimum, mean and Keywords-classification, SSLffLS, services, machine
standard deviation) as features. Besides, the author also learning
introduces the first few bytes of a packet as one of the
characteristics with semi-supervised for analysis. In order to I. INTRODUCTION
compare of packets received constantly and known data se.t,
it used Euclidean distance and hamming distance, and by thiS
The Internet is developing with his incredible pace, and
way the accuracy it get between 80% and 94%.
an increasing number of network protocols encrypt the
Traffic data has been preprocessed respectively before
payload to protect privacy and integrity of o.ur data. The
classification in[4,5]. [4] first use TLS pattern matching to
Transport Layer Security (TLS) [1] protocol IS one of the
filter out all the TLS traffic, then use Navie Bayes to classify
most popular encryption protocols, which is a further stage
the TLS traffic and get precision of the classification for
of the Secure Socket Layer (SSL) protocol, using public key
HTTPS and TOR traffic between 93% and 96%. [5] made a
and private key to encrypt data. We can meet it in many
more detail processing to traffic. The author considered
cases and HTTPS is a most popular application of SSLlTLS
HMAC and segments of TCP and get more actual features of
which we can meet in many websites. They effectively
traffic, finally, its accuracy for HTTP, SMTP, IRC and POP3
protect our information security in the network. However,
classification is 10% to 40% higher than [4]. The author of [6]
not only can encryption protocol protect the information of
took application services of SSLlTLS into consideration, it
network security, but also it can protect some illegitimate
use the public information domain of certificate which in the
behaviors. Most of research on SSLlTLS traffic is about
handshake of SSLlTLS connection for classification, and the
protocol level, and identification of services in real time is
accuracy for Google, Facebook and Kakaotalk applicat�on
required for the effective network traffic management. So we
services reached 90%. Researches above are based on offlme
need to identify the services of SSLlTLS encrypted traffic.
environment, and get features from all the packets of a flow.
In this paper, we propose a hybrid method to identify
It can not classify traffic in real time. However, a certificate
application services of SSLlTLS, First, we filter the HTTPS
may be used by more than one application or an application
traffic data. Second, the HTTPS data is analyzed to
may use more than one certificate. And this may cause some
determine the application services. Meanwhile, we process impact on the result.
the traffic data that get characteristic values from the
In [7], the author proposed a method which only extr�ct
application data which without handshake.
first few packets as statistical information to identify
The remainder of this paper is structured as follows.
application of SSL, and finally experimental result reach
Second section outlined the related work, and third section
85% accuracy. [8] propose a hybrid method which only
describes our approach for application services classification.
analyzed the first few packets of a flow, it consider that
Fourth section shows experimental results and test data. And
features of the first few in a flow had carry enough
final Section concludes the paper.
information to classify. With using k-means and k-nearest
978-1-4673-9026-2/16/$31.00 ©20 16 IEEE 250
neighbor, the accuracy of HTTP, SMTP, POP3, Skype
2, 3, . . . ) packets of a flow as sample. However, in order to
classification reach more than 94%. Although the above
achieve real-time classification, accuracy will be affected. It
researches have achieved nice results, what they classify is
needs to find a balance between accuracy and real-time
applications on the top of SSLlTLS, there is no involved the
performance. We extracted first few packets from a flow and
application services of SSLlTLS.
give up the packet which belong to TCP connection
handshake. By this way, we get experimental results III. HYBRID METHOD illustrated in Fig. l.
To identify the application services of SSLlTLS traffic
As Fig. I shows, when we extract first four packets as
and classify its content, we use a hybrid method and this
sample, accuracy reaches the maximum value 99.3% which
method is divided into two steps. First, to filter out the
is a turning point. In other words, we can get a considerable
HTTPS data, C4.5 [9] is used. Then using random forest [lO]
accuracy in real time when we extract first four packets.
to classify the application services of HTTPS and preprocess
The second step of this method is multi-classification of
the traffic data before classification.
application services. We use random forest machine learning
The basic units in Internet traffic are IP flows. A flow is a
algorithm to perform step. Because decision tree could have
sequence of packets exchanged between a pair of endpoints.
an overfitting problem on multi-classification, and need to
A flow is recognized by 5-tuple which are source IP address,
provide additional processing to alleviate it, random forest
source port number, destination IP address, destination port
performed better in this respect.
number and transport layer protocol. According to the
characteristics of network traffic, we choose packet length
and packet interval-time as features listed in Table I. TABLE I. FEATURES USED OF FLOW Features name Description PkUen_min The minimum length of packets Pkt len max The maximum length of packets Pkt len mean The average length of packets Pkt len sd
The standard deviation of length Pkt inter min
The minimum length of interval-time
Figure 2. The interval time distribution of packets Pkt_inter_max
The maximum length of interval-time Pkt_inter_mean
The average length of interval-time Pkt inter sd
The standard deviation of interval-time
We used the C4.5 decision tree algorithm to filter out the
HTTPS traffic data. Because of the better stabi Iity and fast in
network traffic classification, we choose C4.5.
Figure 3. The result of HTTPS application services classification
The SSLlTLS protocol is divided into five sub-protocols:
the Record Protocol, three handshaking protocols and the
Application Data Protocol. Depending on the client and
server configuration (e.g., usage and size of certificates), the number of packets exchanged during connection
establishment varies. Additionally, the contents (except
keying material) of the handshake messages of client and
server are identical, even if a TLS connection is used by
Figure I. Identification accuracy of first n packets
different applications [5]. So, we only get characteristic
values from application data messages. In real environment,
With the purpose of classification in real time, we could
there is many factors which can affect the results of the
not wait until getting the whole flow data before
classification. Jitter and delay of network will cause the
classification, and the first few packets in a flow have carry
interval time of packets to be changed suddenly. We
enough information about this flow, so we extracted first n (1,
observed from some traffic data and found that most interval 251
time is relatively short and distributed in a certain range B. Experimental Result
showed in Fig. 2. In the figure, the vertical coordinate
The equations are an exception to the prescribed
represents interval-time, and the unit is seconds, while the
specifications of this template. You will need to determine
horizontal coordinate represents sequence of interval-time.
whether or not your equation should be typed using either
However, there are a few abnormal value could affect our
the Times New Roman or the Symbol font (please no other
classification result. We process the interval time with
font). To create multileveled equations, it may be necessary
remove the maximum from interval time list and could
to treat the equation as a graphic and insert it into the text
reduce the interference of outliers with minimal impact with after your paper is styled.
real features. The experimental result is shown in the Fig. 3.
In Fig. 3, accuracy 1 is the result of classification which
we process traffic, and accuracy 2 is the result of
classification which we do nothing to traffic data. It shows
that the preprocessing for traffic data can improve accuracy
and we could get highest accuracy value at the fifth packet. IV.
TEST DATA AND EXPERIMENT AL RESULTS
This section introduces the test data, and discusses the
fmal classification results of application services. A. Test Data
We mainly used two data sets in our experiment. One is
the traces[ 11] which were collected on the edge router of the
Figure 4. The classification result of data set 1
campus network of the University of Brescia on three
consecutive working days (09/30, 10/01 and 10/02)[12,13].
These traces mainly composed of TCP (99%) and UDP
traffic, which corresponds to around 79000 flows in total.
The traffic includes Web (HTTP and HTTPS), Mail (POP3,
IMAP4, SMTP and their SSL variants), Skype, traffic
generated by Peer-to-Peer applications, such as BitTorrent
and Edonkey, and other protocols (FTP, SSH, and MSN)[14].
Details are reported in Table II. We call these traces as data
set 1. Another trace is what we generated, which contains
Webmail, Online payment, Media application and Others.
Details are reported in Table III. We call these traces as data set 2.
Figure 5. The classification result of data set 2 TABLE II.
COMPOSITION OF THE UNffiS 2009-TRACE
Fig. 4 shows the classification result of data set 1. The Class of protocols Flows Bytes
accuracy of all application services is more than 95%. Web 61.2% 12.5%
However, the recall of Mail is highest and others is between Mail 5.7% 0.2%
86% and 91 %. Through our observation of the data set 1, we P2P (Bittorrent) 9.3% 15.9%
fmd that Mail traces accounted for largest in HTTPS traces P2P (Edonkey) 18.4% 70.2%
(about seventy percent), while the other applications share is relatively low. Skype (TCP) 1.4% 1.0%
Because the application of data set 1 covers not very Skype (UDP) 3.8% 0.0%
wide and more application of the distribution is not balanced. Other 0.2% 0.2%
We generate data set 2 for another classification and get Total 78998 27G
classification result of data set 2 shown in Fig. 5. And the
result shows that the accuracy and recall of application TABLE Ill. COMPOSITION OF THE DATA SET 2
services reached about more than 95%, reached what our expected. Application service Flows Webmail 31.9% V. CONCLUSION Online payment 37.8%
This paper proposed a hybrid method to classify the Media application 22.9%
application services of SSLlTLS traffic. First, we filter out
HTTPS traffic with C4.5. Then classify application services Others 7.4%
with random forest. We experimented with public data sets Total 1311
and another data set which we generate, and got satisfying 252
results. It shows that our method can improve the accuracy
[6] Sung-Min Kim; Young-Hoon Goo; Myung-Sup Kim; Soo-Gil Choi;
and recall of application services classification effectively.
Mi-Jung Choi"A Method for Service Identification of SSLlTLS
Encrypted Traffic with the Relation of Session ID and Server IP"
In future, more application services over SSLlTLS
Network Operations and Management Symposium (APNOMS), 2015
should be identified and classified.
17th Asia-Pacific, pp 487 - 490.
[7] Bernaille L, Teixeira R. Early recognition of encrypted applications. ACKNOWLEDGMENT
In Proceedings of the 8th International Conference on Passive and
This work is supported by NSFC (Grant Nos. 61300181,
Active Network Measurement, PAM'07. Springer-Verlag: Berlin, Heidelberg, 2007; 165-175.
61502044), the Fundamental Research Funds for the Central
Universities (Grant No. 2015RC23).
[8] Bar-Yanai R, Langberg M, Peleg D, Roditty L. Realtime classification for encrypted traffic. In Experimental
Algorithms,Lecture Notes in Computer Science, vol. 6049, Springer REFERENCES
Berlin Heidelberg: Berlin, Germany, 2010. 373-385.
[I] http://www.rfc-base.orglrfc-5246.html
[9] UTNLAN J R. c4. 5:Programs for machine learning. San
[2] C. McCarthy and A. Zincir-Heywood, "An investigation on
Mateo:Morgan Kaufman Publishers Inc, 1993:27--48.
identifying SSL traffic," in Computational Intelligence for Security
[10] Breiman L. Random forests[J]. Machine Learning, 2001, 45(1 ):5-
and Defense Applications (CISDA), 2011 IEEE Symposium on, April 32. 2011, pp.1l5-122.
[II] http://netweb.ing.unibs.itl-ntw/tools/traces/download
[3] H. Liu, Z. Wang, and Y. Wang, "Semi-supervised Encrypted Traffic
[12] M. Dusi, F. Gringoli and L. Salgarelli, "Quantifying the accuracy of
Classification Using Composite Features Set," Journal of Networks,
the ground truth associated with Internet traffic traces", Elsevier
vol. 7, no. 8, 2012, pp. 1195-1200.
Computer Networks, Vol. 55, No. 5, pp. 1158-1167, April 2011.
[4] G.-L. Sun, Y. Xue, Y. Dong, D. Wang, and C. Li, "An Novel Hybrid
[13] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso and K.C.
Method for Effectively Classifying Encrypted Traffic," in
Claffy, "GT: picking up the truth from the ground for Internet traffic",
GLOBECOM, IEEE, Dec 2010, pp. 1-5.
ACM SIGCOMM Computer Communication Review, Vol. 39, No. 5,
[5] Chris Richter Michael Finsterbusch, Klaus Hlinf3gen and Jean pp. 13-18, Oct. 2009.
Alexander MUller, "Classification of TLS Applications", ICIMP
[14] http://netweb. ing.unibs. itl-ntw/tools/traces
2014:The Ninth International Conference on Internet Monitoring and Protection. 253