Active Learning for Network Traffic Classification: A Technical Study - arXiv

Page created by Brandon Navarro
 
CONTINUE READING
Active Learning for Network Traffic Classification: A Technical Study - arXiv
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                                   1

                                         Active Learning for Network Traffic Classification:
                                                        A Technical Study
                                                                  Amin Shahraki, Mahmoud Abbasi, Amir Taherkordi and Anca Delia Jurcut

                                          [Note: This work has been submitted to the IEEE Trans-                       networks and maintain their performance, such as Monitor-
                                        actions on Cognitive Communications and Networking jour-                       Analyze-Plan-Execute (MAPE), and Observe-Orient-Decide-
                                        nal for possible publication. Copyright may be transferred                     Act (OODA) [2].
                                        without notice, after which this version may no longer be                         In networking, the process of analyzing the network traffic
                                        accessible]                                                                    behavior is mainly known as Network Traffic Monitoring and
arXiv:2106.06933v2 [cs.NI] 5 Aug 2021

                                           Abstract—Network Traffic Classification (NTC) has become an                 Analysis (NTMA) [3]. NTMA has attracted much interest
                                        important feature in various network management operations,                    in recent years and become an important research topic in
                                        e.g., Quality of Service (QoS) provisioning and security services.
                                        Machine Learning (ML) algorithms as a popular approach for                     the field of communication systems and networks [4]. The
                                        NTC can promise reasonable accuracy in classification and deal                 importance of NTMA lies in the properties and challenges
                                        with encrypted traffic. However, ML-based NTC techniques                       of modern networking, e.g., heterogeneity, complexity, and
                                        suffer from the shortage of labeled traffic data which is the                  dynamicity, resulting in instability in data transmission [5].
                                        case in many real-world applications. This study investigates the              NTMA is an essential approach to measure the performance of
                                        applicability of an active form of ML, called Active Learning
                                        (AL), in NTC. AL reduces the need for a large number of                        applications and services, and to discover network inefficien-
                                        labeled examples by actively choosing the instances that should                cies. Indeed, NTMA allows us to shed light on the functioning
                                        be labeled. The study first provides an overview of NTC and                    of communication systems and to deal with unexpected events,
                                        its fundamental challenges along with surveying the literature                 especially in complex and large-scale networks, such as the
                                        on ML-based NTC methods. Then, it introduces the concepts of                   Internet.
                                        AL, discusses it in the context of NTC, and review the literature
                                        in this field. Further, challenges and open issues in AL-based                    NTMA applications are generally categorized into eight
                                        classification of network traffic are discussed. Moreover, as a                groups, including Network Traffic Classification (NTC), traffic
                                        technical survey, some experiments are conducted to show the                   prediction, fault management, network security, traffic routing,
                                        broad applicability of AL in NTC. The simulation results show                  congestion control, resource management, and Quality of
                                        that AL can achieve high accuracy with a small amount of data.                 Service (QoS) and Quality of Experience (QoE) management
                                                                                                                       [6]. In this study, we focus on NTC as an important and open
                                          Index Terms—Survey, Network Traffic Classification, Active                   issue in NTMA. NTC refers to techniques for categorizing
                                        Learning, Machine Learning, NTMA
                                                                                                                       network traffic into different classes based on their properties.
                                                                                                                       The classification of network traffic is highly beneficial in
                                                                  I. I NTRODUCTION                                     various network services from QoS (e.g., traffic policing and
                                           During the last decades, emerging new networking                            shaping) and pricing to malware and intrusion detection [7].
                                        paradigms, such as Internet of Things (IoT), have introduced                   NTC provides detailed knowledge on network traffic, which
                                        various network management challenges. Given the prolif-                       is very useful for those who investigate the changes in traffic
                                        eration of IoT devices and the distinguishing characteristics                  characteristics and long-term requirements of networks [8],
                                        of IoT traffic, such as heterogeneity, spatio-temporal depen-                  e.g., Network Management and Orchestration (NMO) tools,
                                        dencies, dominating uplink traffic, and low duty-cycle traffic                 and performance management models.
                                        patterns, network management and monitoring has become                            NTC techniques can be broadly grouped into three cate-
                                        challenging. Gaining deep insight into such complex networks                   gories: port-based, payload-based, and flow-based methods
                                        for performance evaluation and network planning purposes is                    [9]. Port-based techniques associate a standard port number
                                        not a trivial task with respect to processing time, human effort,              to a service or application, while payload-based methods
                                        and computational overhead. Understanding network traffic                      carefully inspect the content of the captured packets to classify
                                        behavior plays a vital role in a wide variety of network man-                  them. Last but not least, flow-based techniques utilize the
                                        agement aspects, e.g., fault management, accounting, security,                 network traffic flow characteristics (e.g., round-trip time and
                                        and network performance management [1]. Some general                           inter-arrival times) to associate produced traffic to the related
                                        approaches have been introduced to analyze the behavior of                     sources. The two latter methods cannot be used in some
                                                                                                                       network types (e.g., Virtual Private Network (VPN)), or violate
                                          Amin Shahraki is with School of Computer Science, University College         the privacy of users by accessing their personal data. Flow-
                                        Dublin, Ireland. Corresponding author e-mail: (am.shahraki@ieee.org)
                                          Mahmoud Abbasi was with Department of Computer Sciences, Islamic             based techniques are the most common techniques for NTC
                                        Azad University, Mashhad, Iran, email: mahmoud.abbasi@ieee.org                 as instead of inspecting all packets passing through a given
                                          Amir Taherkordi is with the Department of Informatics (IFI), University of   link, they examine network traffic flows or an aggregated form
                                        Oslo, Norway. email: amirhost@ifi.uio.no
                                          Anca Delia Jurcut is with Department of Computer Sciences, University        of the network header packets information. As a result, the
                                        College Dublin, Dublin, Ireland, email: anca.jurcut@ucd.ie                     volume of data needed to be examined will be reduced, and
Active Learning for Network Traffic Classification: A Technical Study - arXiv
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                       2

the encrypted traffic will no longer be a problem. Flow-based        CFS        Correlation based Feature Selection
techniques assume that each application’s traffic has almost         CNN        Convolutional Neural Network
unique statistical or time-series features that can be utilized      DDoS       Distributed Denial of Service
by classifiers to categorize both encrypted and regular traffics.    DL         Deep Learning
   In flow-based methods, the traffic classifier may leverage        DPI        Deep Packet Inspection
Machine Learning (ML) algorithms to automate the classifica-         EER        Expected error reduction
tion process, discover different traffic patterns produced by de-    GAN        Generative Adversarial Network
vices, and classify encrypted traffic. Although ML algorithms        i.i.d      Identically and independently distributed
are powerful techniques to classify network traffic flows [10],      IDSs       Intrusion Detection System
[11], the accuracy of learning-based approaches is limited by        IoT        Internet of Things
their need for a massive number of labeled instances. As the         LAL        Learning Active Learning
authors in [12] mentioned, most of the real-world application        LSTM       Long Short-Term Memory
data is semi-labeled or unlabeled data. Moreover, the data           M2M        Machine-to-Machine
labeling process for ML tasks can be challenging in terms            MAPE       Monitor-Analyze-Plan-Execute
of human effort and cost [13].                                       ML         Machine Learning
   Fortunately, Active Learning (AL), as a sub-field of ML, is       MLP        Multi-layer Perceptron
a promising approach to deal with the need for a huge amount         NMO        Network Management and Orchestration
of labeled instances. AL aims to reduce the need for labeled         NTC        Network Traffic Classification
examples by intelligently querying the labels during training.       NTMA       Network Traffic Monitoring and Analysis
The query goes for the examples that the AL algorithm                OODA       Observe-Orient-Decide-Act
believes will help build the best model [14]. Therefore, based       P2P        Peer-to-peer
on the aforementioned challenges, AL can be considered as            QBC        Query-By-Committee
an appropriate and efficient technique for flow-based NTC.           QoE        Quality of Experience
Providing a thorough study on the usefulness of AL in NTC            QoS        Quality of Service
and reviewing the state-of-the-art techniques in this field can      RAE        Relief Attribute Evaluation
significantly help the network research community in better          RAL        Reinforcement AL
adoption of AL for classification of network traffic in various      RL         Reinforcement Learning
domains. To the best of our knowledge, this is the first and only    SDAE       Stacked Denoising Autoencoder
study that technically reviews the efficiency and importance of      SDN        Software Defined Networking
AL for NTC along with surveying the literature in this field.        SFEM       Structural Feature Extraction Methodology
In this paper, we study the NTC techniques and discuss AL            SVDD       Support Vector Data Dscription
as a useful approach in this field. The main contributions of        SVM        Support Vector Machine
our work are summarized as follows:                                  TLS        Transport Layer Security
                                                                     UNC        Uncertainty sampling
   • Discussing NTC techniques and their correlations with
                                                                     VAE        Variational Autoencoder
     ML techniques
                                                                     VPN        Virtual Private Network
   • Reviewing existing work in AL-based NTC
                                                                     WSNs       Wireless Sensor Networks
   • Empirical evaluation of the performance of AL for NTC
     purposes
   • Discussing the challenges, and future directions in using                     II. R ELATED S URVEY A RTICLES
     AL for NTC
                                                                       There exist several literature studies reviewing the use
The rest of this paper is structured as follows: In Section
                                                                    of ML techniques in communication systems and wireless
II, we review existing survey works on traffic classification
                                                                    networks, e.g., [15], [16]. There are also some surveys that
techniques. In Section III, we provide an overview of the
                                                                    focus on specific ML techniques, e.g., Deep Learning (DL)
NTC problem and the use of ML techniques. Then, we devote
                                                                    [17] and Reinforcement Learning (RL) [18] , or specific types
Section IV to discussing the fundamental elements of AL and
                                                                    of networking, e.g., Software Defined Networking (SDN) [19]
query strategies. Next, in Section V, we discuss the advantages
                                                                    and optical networks [20]. Moreover, some survey works com-
of using AL for NTC purposes and carry out a literature review
                                                                    pare, evaluate or review different techniques, e.g., ML-based
on this topic. In Section VI, we evaluate the performance of
                                                                    techniques, heuristic models and statistical-based techniques
AL in NTC. In Section VII, we discuss the challenges and
                                                                    for NTC e.g., [21]. Considering the volume of survey literature
future directions in using AL for NTC, and finally we conclude
                                                                    in this field, in this section, we focus only on surveys that
the paper in Section VIII.
                                                                    review NTC or the use of various ML techniques in NTC.

                  L IST OF ABBREVIATIONS
                                                                      •   General literature reviews on NTC: In [22], Dainotti
 AL         Active Learning
                                                                          et al. reviewed the issues and future research directions
 ALBL       AL by learning
                                                                          of NTC, especially in case of applicability, reliability
 ASVM       AL Support Vector Machine
                                                                          and privacy. They outlined the research and policy future
                                                                          directions of NTC, e.g., validating the NTC models, effect
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                         3

     of network speed in NTC and NTC tools. In [23], Fin-                 Pacheco et al. comprehensively surveyed the use of ML
     sterbusch et al. reviewed the payload-based NTC based                techniques in NTC for different cases, e.g., encrypted
     on Deep Packet Inspection (DPI). They also practically               network traffic. By understanding the challenges of using
     analysed the most significant open-source DPI modules                ML techniques in NTC, they studied the reliable label
     to show their performance in terms of accuracy and                   assignment, dynamic feature selection, integrating the
     requirements. Additionally, they provided a guideline on             meta-learning processes. They considered these solutions
     how to design and implement DPI-based NTC modules.                   to solve several issues, including imbalance network data,
     In [24], Velan et al. studied NTC models for encrypted               dynamicity of networks, and online strategies for re-
     network traffics to measure the traffic and improve the              training the ML models.
     security, e.g., detecting anomalies. They have reviewed            In Table I, a summary of the surveys above is provided
     different types of encrypted traffics and how payload-          based on their vision of NTC, the reviewed solutions, network
     based and feature-based NTC techniques can classify             type and practical evaluation of studied solutions. As indicated
     encrypted network traffics. Zhao et al. [7] reviewed the        in the table, our survey is for flow-based NTC for the use
     use of NTC in IoT and Machine-to-Machine (M2M)                  in Internet communications and specifically considers AL as
     networks. They reviewed the current NTC within the IoT          one of the most important ML-based solution. To the best
     context based on the differences between IoT and non-IoT        of our knowledge, our study is one of the rare literature
     network traffics. By reviewing the literature, the authors      surveys that evaluates such specific ML solutions for NTC as
     showed that in IoT research area, most of NTC techniques        most of existing surveys consider general ML models, e.g.,
     are proposed to solve security challenges. The authors in       supervised learning solutions for NTC. Studying AL-based
     [25] reviewed the NTC techniques, i.e., statistics-based        solutions makes our work different from all existing survey
     classification, correlation-based classification, behavior-     works.
     based classification, payload-based classification, and
     port-based classification. They also quantified classifica-
                                                                                  III. OVERVIEW ON NTC AND ML
     tion granularity based on four levels, i.e., application type
     layer, protocol layer, application layer and service layer.        In NTC, one should clarify the goals of classification based
     Last but not least, they classified network traffic features    on the intended use, such as for accounting purposes, malware
     and the existing public datasets that are commonly used         detection, intrusion detection, providing QoS, and identifying
     in the proposed NTC techniques.                                 types of applications based on the network traffic (e.g., VPN
 •   Literature reviews on the use of ML in NTC: As one of           and nonVPN traffics or Tor and nonTor traffics). Indeed, there
     the earliest study in the use of ML in NTC, Nguyen et           are different factors that one can use to categorize network
     al. [26] reviewed the literature between the years 2004 to      traffic, including applications (e.g., Facebook and Hangouts),
     2007. They studied how ML models can be employed                protocols (e.g., HTTP and BitTorrent), traffic types (e.g., Web
     for NTC in IP networks, e.g., clustering approaches,            Browsing and Chat), browsers (e.g., Firefox and Chrome),
     supervised learning approaches and hybrid approaches.           operating systems, and websites. Therefore, the purpose is to
     They also reviewed the literature that compares ML tech-        determine the label of each network flow truly, e.g., browsing,
     niques or non-ML techniques for NTC. They mentioned             interactive, and video stream. NTC can be further categorized
     that offline analysis models, e.g., AutoClass, Decision         into online and offline classification. In online NTC, the input
     Tree and Naive Bayes can achieve a high accuracy for            traffic needs to be classified in a real-time or near real-time
     about 99%. They also outlined some critical operational         manner (e.g., QoS provisioning). On the other hand, offline
     requirements for real-time NTCs models compared to              classification is appropriate for applications such as anomaly
     offline models. In [21], Singh evaluated the unsupervised       detection and billing systems. Despite their importance, exist-
     ML techniques including K-means and Expectation Max-            ing NTC techniques suffer from general networking challenges
     imization algorithm for NTC. The results show that the          as listed below:
     accuracy of K-Means is better than Expectation Maxi-               • While the literature on traffic classification is mature
     mization algorithm. In [27], Perera et al. compared six               to adapt to old-fashioned networking paradigms, e.g.,
     ML algorithms including Naive Bayes, Bayes Net, Naive                 legacy cellular systems, the dramatic growth and evolu-
     Bayes Tree, Random Forest, Decision Tree and Multi-                   tion of online applications and services have made traffic
     player Perceptron along with two feature extraction tech-             classification a non-trivial task. Due to the traffic char-
     niques, i.e., Correlation based Feature Selection (CFS)               acteristics of modern networks, e.g., being large-scale,
     and Relief Attribute Evaluation (RAE). Their results show             heterogeneity, multimodal data, and big data, emerging
     that Decision Tree and Random Forest have better perfor-              NTC methods must meet strict requirements in terms
     mance compared to other techniques. In [28], Gomez et                 of system performance, accuracy, and robustness. For
     al. compared seven ensemble ML techniques including                   example, the vast amount of raw data generated by IoT
     OneVsRest, OneVsOne, Error-Correcting Output-code,                    and cellular devices pose severe challenges to ML-based
     Adaboost classifier, Bagging algorithm, Random Forest                 NTC methods as they need clean and pre-processed data
     and Extremely Randomized Trees which are all based                    for training purposes.
     on decision trees in NTC. They compared them in case               • NTC is a multi-factor procedure in which an automated
     of model accuracy, latency and byte accuracy. In [29],                program categorizes the network traffic based on the
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                                                           4

                                                 Table I: An overview of existing literature surveys on NTC and ML.
                                                                                                                                                           Practical
                 Study              Year   NTC vision                        Reviewed Solution(s)                                Type of network
                                                                                                                                                          Evaluation
                 [26]               2008   Analysing Statistical traffic     ML Solutions                                        IP networks              No
                                           Characteristics
                 [22]               2012   General NTC                       Not Specified                                       TCP Networks             No
                 [23]               2014   Payload-Based           NTC       DPI-based techniques                                Internet                 Yes
                                           techniques
                 [21]               2015   Comparative Study                 Comparing unsupervised ML techniques                Internet                 Yes
                 [24]               2015   Analysing encrypted network       ML techniques and hybrid techniques                 Not Specified            No
                                           traffic by payload-based and
                                           feature-based NTC technique
                 [27]               2017   Comparative Study                 Comparing six ML Solutions                          Communication Networks   Yes
                 [28]               2017   Comparative Study                 Comparing Decision-tree based ensemble techniques   Internet                 Yes
                 [29]               2018   ML-based NTC                      Most existing ML solutions                          IP Networks              No
                 [7]                2020   NTC for M2M network traffic       Generic solutions                                   IoT                      No
                 [25]               2021   Reviewing various types of        Most existing ML solutions                          Internet                 No
                                           NTC models
                 Our Study          2021   Flow-based NTC                    Active Learning                                     Internet                 Yes

     network traffic features, e.g., types of network protocols,                                    A. Data gathering
     applications, hosts, etc. As a challenge, NTC techniques                                          Since ML algorithms learn to classify the data based on
     need to select the best features to classify the network                                       sample datasets, representative data must be collected as the
     traffic with high accuracy, while each of them can be ef-                                      data gathering step. While a few publicly available network
     ficient or inefficient from one network to another network.                                    traffic datasets have been released, using these to train a
     In other words, feature engineering is a challenge when                                        traffic classification model can be difficult [33]. In addition,
     it comes to using classical ML for traffic classification.                                     since the behavior of the network traffic is different from one
  • The recent increase of encrypted network traffic and                                            network to another one, it is highly recommended to train
     protocol encapsulation methods limit the effectiveness                                         the ML algorithm for the target network [2]. Additionally,
     of many traffic classification techniques since the packet                                     the number of network traffic classes can be high, and it is
     inspection techniques are unable to extract network man-                                       rather impractical to consider all classes in one public dataset.
     agement information from network traffics. For example,                                        Furthermore, there are a variety of data gathering and labeling
     a significant portion of the Internet traffic is associated                                    techniques that lead to different feature sets. Hence, in real-
     with Peer-to-peer (P2P) applications. However, classifi-                                       world applications, the goal is to use datasets that are tailored
     cation of P2P traffic is a difficult task [7] as many P2P                                      to the intended use of NTC, mainly gathered from the target
     applications, such as online video and P2P downloading,                                        network.
     use encryption and obfuscation protocols to remove the
     limitations posed by Internet service providers.
                                                                                                    B. Data pre-processing
  To overcome the above challenges, various techniques have
been introduced, e.g., graphical techniques, statistical methods                                       After gathering, the data must be pre-processed such that
and ML-based methods [24]. In the scope of ML, various                                              it is represented in a form that the target ML algorithms can
solutions for port-based, payload-based, and flow-based have                                        discover different patterns. In traffic classification, header data
been proposed as the most promising solutions for NTC [30]                                          and payload are two major data structures. These structures
[31]. Multiple steps are needed for building a ML-based                                             often need to be pre-processed because they contain irrelevant
network traffic classifier as presented in [32]. Figure 1 shows                                     or redundant information, such as network management data,
a graphical description of all steps. In the rest of this section,                                  which is not needed for traffic classification, e.g., source and
we discuss each individual step.                                                                    destination IP addresses, and protocol information. Moreover,
                                                                                                    changes in the distribution of packet-level features can occur
                                                                                                    in real-world environments because of unexpected events like
                    Steps towards building a ML-based network traffic classifier
                                                                                                    the re-transmission of packets. In short, performing some
 Data gathering
                          Data pre-          Feature          Model            Model                pre-processing steps such as packet filtering, elimination of
                         processing        engineering       selection       evaluation
                                                                                                    noisy samples, header removal, and data quality assessment
                                                                                                    is needed to ease the learning process for the ML algorithms
       Public                Packet          Time series         Header+
      datasets               filtering         features        time series                          [34].
                                            Header-related     Header+
     Exclusive               Header
                                              features         payload
     datasets                removal

                          Data quality
                          assessment
                                              Statistical
                                               features
                                                               Statistical
                                                                features
                                                                                                    C. Feature engineering
                                                                                                       Conventional classification solutions, e.g., ML- and
                                               Header
                                               removal                                              statistical-based techniques, need to go through a feature
                                                                                                    engineering procedure, in which domain knowledge is used to
Figure 1: The main steps in building a network traffic classifier.                                  extract features or patterns from the raw data [35], [23]. Fea-
                                                                                                    ture engineering is a crucial step in ML-based NTC methods
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                            5

because of the fact that choosing appropriate features can ease             features of network flows generated by different services
the difficulties of the modelling phase, and vice versa [36]. It is         or applications are almost unique. Nevertheless, a big
worth mentioning that considering privacy, the risk associated              challenge with the methods that use statistical features
with feature engineering and representation procedures is also              is that they are not suitable for online classification. This
crucially important, especially in the payload feature-based                is mainly due to the fact that a classifier needs to monitor
techniques. Indeed, there are some legal restrictions on using              the entire or significant part of a network flow in order
payload-based methods in many environments or recognizing                   to extract statistical features.
all communication protocols. This is mainly due to the user’s
privacy policies, as such methods inspect the content of the
                                                                       D. Model selection
network packets [37].
 Generally, there are four major types of input features for              Another step towards building a traffic classifier is selecting
NTC:                                                                   the right ML model. In the context of ML, choosing a model
                                                                       can carry different meanings, such as the selection of hyper-
  •   Time series: Considering time series related features,           parameters and parameters, as well as algorithm selection.
      one can refer to maximum packet inter-arrival time,              Given NTC, several factors can be involved in the selection of
      maximum number of bytes in packet, and inter-packet              the classification model (e.g., model performance, available
      timings. According to [38], the length of time series (or        resources, model complexity, and feature selection). One of
      the number of packets within a flow) has a visible effect        the most significant factors is feature selection. This is due
      on classification accuracy and computational overhead.           to the fact that there is a direct correlation between features
      Specifically, increasing the number of considered packets        and input dimensions of the model, and consequently the
      can improve the classification performance but at the            computational and memory complexities of the model, which
      cost of higher computational overhead. In [38], only             are crucial factors in NTC. This implies that the dimensions
      the first 20 traffic packets in a flow are used for the          and structure of input data for training purposes should be
      experiments. The authors in [39] use the time-series             optimized. Moreover, the selected features directly affect the
      features of packets, e.g., source and destination ports,         performance of the final learning task (e.g., classification and
      payload size, and TCP window size (bytes) as input for           regression) and the dimensions of the input data for training.
      a semi-supervised model to perform traffic classification        Hence, one should consider the right number of informative
      related to the five Google services, including Hangout           features. In the context of traffic classification, it may be not
      Chat, Hangout Voice Call, YouTube, File transfer, and            sufficient to consider the model performance as the only factor
      Google play music. The simulation result shows excellent         for model selection. Thus, one can also consider other criteria,
      accuracy, despite using a limited number of labeled data         such as training time and model explainability.
      samples. This is mainly because they conducted a pre-
      training step on the entire unlabeled network flows in
                                                                       E. Model Evaluation
      order to learn statistical features, and then they re-trained
      the model using a small labeled dataset for fine-tuning.            Finally, the evaluation of the selected model is the final
  •   Header: The header of a network packet contains infor-           step in building a network traffic classifier. In this step, the
      mation related to different layers (e.g., the network layer).    performance of the ML model on unseen data is measured.
      Features such as port number and protocol number are             The ML model should be able to give accurate predictions
      widely used as informative features in traffic classification    to be useful for the given task. However, the accuracy is not
      tasks. However, some modern NTC techniques, especially           the only evaluation metric for a classification task, and other
      DL-based, accept entire packets as the input feature. For        metrics such as confusion matrix, F1 score, recall, etc. should
      example, in [40] the authors used hexadecimal raw packet         be considered. NTC is a classification task, and we use the
      header and convolutional networks to classify Tor/non-           same metrics to evaluate the performance of the proposed
      Tor traffic. To this end, they utilized TCP/IP headers,          model.
      especially the first 54 bytes of packets, because TCP is
      associated with around 90% of all the Internet traffic.          F. Existing Work
  •   Payload: NTC techniques can also use layer-related
      information above the transport layer to classify network           Recently, several ML techniques have been proposed for
      traffic. As a prime example, in [41] the authors utilize         network traffic classification. In this subsection, we categorize
      BitTorrent handshake packets on layer 4 to classify the          existing work in the literature based on the goals of network
      BitTorrent traffic. BT generates the highest amount of           traffic classification, including identifying applications (also
      P2P traffic. Moreover, some works use packets related to         called apps), cyber security purposes, fault detection, website
      the Transport Layer Security (TLS) handshake process to          fingerprinting, user activities identification, and operating sys-
      identify HTTPS services [42].                                    tems identification. We discuss these goals in more details in
  •   Statistical features: The statistical features of network        the sequel.
      flows, such as minimum inter-arrival time and size of               • Mobile apps identification: This goal refers to analyzing
      the IP packets can be used for NTC [43]. The main                      and finally identifying the network traffic related to a
      idea behind using statistical features is that the statistical         particular mobile app. Given the ever-increasing number
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                       6

     of mobile apps, network administrators and telecommu-               al. utilize federated learning for malware detection in IoT
     nications companies are actively looking for rigorous               devices through one supervised model (based on Multi-
     methods to secure their infrastructure. Apps identification         layer Perceptron (MLP)) and one unsupervised model
     based on analyzing the network traffic of mobile apps can           (based on autoencoder). To evaluate the framework, they
     assist network administrators with resource management              use N-BaIoT dataset, which models the traffic of IoT
     and planning, and app-specific policy establishment (e.g.,          systems impacted by malware. In [52], McLaughlin et
     security policy establishment and access management for             al. present a DL-based method for Android malware
     a specific app). Furthermore, the identification of apps can        detection using the raw opcode sequence as the in-
     help protect smartphone platforms (e.g., Android) against           put of a CNN model which can automatically learn
     emerging security threats and uncover sensitive apps.               the features of malware instances. The authors claimed
     Moreover, by app identification, it is possible to forbid the       that the proposed method has a more straightforward
     use of some particular apps (e.g., Google+ and Instagram)           training pipeline than the previously proposed works
     in an enterprise network [44]. Several papers have been             (e.g., n-gram-based malware detection). Huang et al. [53]
     published on app identification. Ajaeiya et al. in [44]             combine the unsupervised spatiotemporal encoder with
     present a framework for the classification of Android               LSTM to detect abnormal network traffic. The spatial
     apps. The proposed framework identifies apps traffic from           feature of network traffic data was extracted in the first
     a network viewpoint without adding any overhead on                  stage by the spatiotemporal model. Then, the obtained
     users’ mobile phones. Moreover, the authors provide a               features are used to train another LSTM layer for the
     pre-processing method for traffic flows to extract the              classification purpose. NSL-KDD dataset was used for
     most informative features for ML-based techniques. The              the evaluation of the model. Based on the experimental
     work in [45] leverages Variational Autoencoder (VAE) for            results, using the proposed DL model, the efficiency of
     the identification of mobile apps. The authors claimed              intrusion detection is significantly high compared to the
     that their method is able to label a massive number of              traditional techniques.
     instances and extract the features in mobile apps traffic       •   Fault detection: Fault detection is part of a more ex-
     automatically. To this end, the authors first transform the         tensive network management process, called fault man-
     mobile apps traffic to meaningful images, and then use              agement. Fault management points to a set of processes
     VAE as a classifier. Similar work was carried out by Wang           to detect, isolate, and then correct unusual situations of
     et al. in [46], in which the authors design three DL-               a network. Failure occurs when a system (e.g., an IoT
     based models, including Stacked Denoising Autoencoder               network) cannot adequately provide a service, where a
     (SDAE), 1D Convolutional Neural Network (CNN), and                  fault is the source cause of a failure. Fault manage-
     Long Short-Term Memory (LSTM) for mobile apps                       ment, especially fault detection, play an essential role in
     identifications. The authors in [47] provide a multi-               today’s network management (e.g., QoS provisioning).
     classification scheme for the classification of mobile apps         Hence, many works have been conducted to improve
     traffic. More specifically, they combine several mobile             the fault management process. In [54], Huang et al.
     traffic classifiers’ decisions (knowledge) to classify their        survey fault detection techniques in IoT networks and
     traffic samples.                                                    introduce a fault-detection framework for Self-Driving
 •   Cybersecurity purposes: One of the main goals of                    Network (SelfDN)-enabled IoT. Moreover, the authors
     traffic classification is detecting security breaches in            propose an algorithm called Gaussian Bernoulli restricted
     communication systems, e.g., intrusion detection, mal-              Boltzmann machines auto-encoder to change the fault-
     ware detection, anomaly detection, and worm detection.              detection into a classification task. The simulation result
     Cybersecurity tools/techniques (e.g., intrusion detection           demonstrates the superiority of the proposed method to
     systems) aim to defend communication systems from                   other adopted methods, such as linear discriminant anal-
     internal/external threats. Traffic classification methods           ysis and SVM. In [55], the authors focus on the problem
     can be used to assess network traffic behavior through              of cell coverage degradation detection through a deep
     detecting malicious traffic flow/link, and then prevent             neural network. They propose a deep recurrent model
     attacks. A large body of work in the literature has                 for diagnosing cell radio performance deterioration and
     focused on ML-based malware and intrusion detection.                complete cell outages in a mobile phone network. In [56],
     The authors in [48] propose an intrusion detection ap-              Noshad et al. adopt the Random Forest classifier for fault
     proach based on deep neural networks and compare the                detection in Wireless Sensor Networks (WSNs). They use
     performance of DL with classical ML classifiers, demon-             a dataset with six types of faults at the sensor levels for
     strating the superiority of DL models. Similarly, in [49],          performance evaluation, such as data loss, offset, and out-
     Shone et al. propose a non-symmetric deep auto-encoder-             of-bounds. Moreover, they compare the performance of
     based learning solution for intrusion detection. The auto-          the proposed method with other well-known techniques,
     encoder network has been used for learning features in              e.g., MLP, CNN, and probabilistic neural networks.
     an unsupervised manner. Then, they employ a stacked             •   Website fingerprinting: It refers to methods for identify-
     non-symmetric auto-encoder as a traffic classifier. In [50],        ing and collecting data about websites visited by a mobile
     Nguyen et al. propose a federated self-learning method to           device, which is essential for the advertising industry,
     detect anomalies in IoT systems. Similarly, in [51], Rey et         identifying the characteristics of attacks (e.g., botnets
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                          7

     and sniffing) and protecting users’ privacy. Website fin-            obtain the more stable traffic features. Hou et al. in [63]
     gerprinting can help recognition of fraudsters and other             categorize user activities of the WeChat application by
     unusual activities. Moreover, website fingerprinting can             performing a detailed analysis on the encryption protocol
     be considered as a type of traffic analysis attack that              of this application, called MMTLS, to find the typical user
     allows eavesdroppers to get information on the victim’s              activities of the application (e.g., advertisement click and
     activities. Given the importance of website fingerprinting,          browsing moments). Then, they adopt different learning
     there is a large body of literature on this topic. In                algorithms, such as Naive Bayes, Random forest, and
     [57], Rahman et al. leverage the idea of adversarial ML              Logistic Regression, to classify these activities.
     to defend users against website fingerprinting attackers.        •   Operating systems identification: This refers to identi-
     The authors propose a method to generate adversarial                 fying the operating system installed on a mobile device
     examples to decline the accuracy of the attacks that use             by analyzing its generated traffic. Adversaries can use
     learning-based techniques for robust traffic classification.         operating systems identification to launch more serious
     The simulation results show that the proposed method                 attacks against a specific mobile operating system. More-
     can decline the accuracy of the state-of-the-art attack              over, it is desirable to use this analysis to investigate the
     by half. The work in [58] focuses on the concept drift               popularity of the mobile operating systems (e.g., Android
     problem in static website fingerprinting attacks for the             and iOS) among users. Hagos et al. in [64] introduce a
     Tor network. The authors refer to the fact that it is costly         learning-based technique for passive operating systems
     to update static attacks in dataset updating and retrain             fingerprinting. They use classical ML (i.e., Support Vec-
     the model. Hence, they introduce AdaWFPA, an adap-                   tor Machine (SVM), Random Forest, k-nearest neighbors,
     tive online website fingerprinting attack that leverages             and Naive Bayes) and DL algorithms (i.e., MLP and
     adaptive stream mining techniques. Luo et al. in [59]                LSTM ) for classification purposes. Moreover, the authors
     propose Random Bidirectional Padding (RBP), a website                propose to use the underlying TCP variant as a practical
     fingerprinting obfuscation technique against intelligent             feature for improving classification accuracy. The authors
     fingerprinting attacks. It uses time sampling and random             in [65] compare the performance of the ML-based tech-
     bidirectional packets padding to change the inter-arrival            niques, such as k-nearest neighbors and Decision Tree,
     time characteristics in the traffic flow, and consequently,          with the traditional commercial rule-based strategy for
     to identify more complex patterns in network packets.                operating systems fingerprinting. The simulation result
 •   User activities identification: Such traffic analysis can            demonstrates the superiority of the learning-based tech-
     be used to obtain exciting pieces of information about a             niques to the traditional method. Lastovicka et al. in [66]
     specific action that a mobile subscriber carries out on              investigate the performance of the three famous operating
     his/her device (e.g., posting a video on Twitter). The               system fingerprinting techniques, including user-agent,
     identification of the user activities may also be made               TCP/IP parameters fingerprint, and specific domains com-
     to get information about a specific activity, such as the            munication. Performance measures reveal that the method
     length of a message sent by a user within a particular chat          based on user-agents provides better performance than its
     application. User activity identification can be utilized            counterparts.
     by adversaries/researchers to reveal the identity behind
     an unknown user, e.g., in a social media, that prefers
                                                                                         IV. OVERVIEW ON AL
     to remain anonymous. This can be done by behavioral
     profiling for the users of a network, which is helpful for         A supervised machine learns to discriminate the different
     identifying reconnaissance within the network. Moreover,       traffic classes by being trained on labeled training data. While
     such traffic analysis offers a possibility to character-       capturing large quantities of network data is relatively easy,
     ize the users’ habits in a network, e.g., chatting with        analysing the data by ML techniques can be a very time-
     friends in the morning and watching the video stream           consuming, expensive, or human-labor intensive process. This
     in the evening. The user’s behavior information can be         is mainly because of the complexity of ML techniques or the
     employed next time to detect the user presence in the          shortage of labeled data resulting in inefficient training. In
     network. In [60], Conti et al. use ML techniques (i.e., Dy-    order to reduce the number of needed labeled examples and,
     namic Time Warping (DTW), hierarchical clustering, and         consequently, reduce the effect of ground truth challenge, AL
     Random Forest) for analyzing Android encrypted network         can be used to facilitate labeling.
     traffic, and consequently, to identify user actions (e.g.,         AL systems can participate in the gathering and selection
     email actions, including sending email, replying, and          of training instances, such that only the most informative
     Facebook actions). The authors in [61] leverage transfer       examples are required to be labeled. Using AL, a learner
     learning to analyze encrypted mobile traffic to deal with      follows an iterative strategy in which it interacts with an oracle
     the problem of diversity of app releases, mobile operating     to choose the most useful data instances to be labeled, thereby,
     systems, and model of devices, and identify user actions.      it reduces the cost of data labeling by using only a few labeled
     The work in [62] focuses on the identification of the          examples to deliver satisfactory performance in a reasonable
     Instagram user behavior. Unlike previous works that used       time. The AL paradigm is illustrated in Fig. 2, in which the
     the statistical features of encrypted traffic, this work       three core components are: query strategy, annotator, and
     provides a new technique based on maximum entropy to           ML model. The query strategy is responsible for choosing
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                                             8

unlabeled data according to a pre-defined policy. A label                         the performance of the query strategies in Section VI. Note
is then provided for the selected data by a human/machine                         that in ML terminology, hypothesis space refers to the all
annotator, and the data is added to the set of training instances.                possible legal hypotheses, where a hypothesis is a particular
Afterwards, the model is updated, and the process repeated                        computational model that best explains the target data in
as long as new data is available, or a stopping criterion is                      supervised ML. In active learning settings, a query strategy can
satisfied. Different stopping criteria can be defined to end                      search the hypothesis space through testing unlabeled samples
this iterative process, such as reaching the desired accuracy,                    to reduce the number of legal hypotheses under attention.
running time, or a maximum number of queries, which can
directly affect the performance of using AL.                                        •   Uncertainty sampling (UNC): In UNC, a learner prefers
   There are mainly two AL scenarios to consider, namely,                               to label the instances where the model is most uncertain
stream-based selective sampling and pool-based sampling                                 about the class of the example. The idea behind the strat-
(presented in Fig. 3). In the former, the distribution of un-                           egy is that those examples on which the model exhibits
labeled instances is known, and the instances are considered                            the most degree of uncertainty are most likely to improve
one at a time. The learner then observes each instance in                               the performance of the model over time. Different criteria,
sequence and decides whether the instance should be labeled                             also called uncertainty strategies, for measuring uncer-
or discarded. AL is a promising technique to alleviate the                              tainty, have been proposed including posterior probability,
challenge of streaming-based learning scenarios [67], [68].                             smallest margin, and entropy [70]. Entropy is one of the
AL algorithms designed for streaming scenarios can control                              most popular uncertainty strategies in many AL problems.
the labeling process and gradually perform this process over                            In an n-class classification problem, assume the estimated
time [69]. Using this strategy, it is expected that the labeling                        probabilities of the n classes are p1 , . . . , pn , respectively.
process will be in balance and the algorithms will detect                               Given the currently labeledPdata instances, the entropy
                                                                                                                         n
the changes. In the case of pool-based sampling, a pool of                              is defined as E(X) = − i=1 pi . log(pi ). Given this
unlabeled data is provided, and the aim of the learner is to                            expression, a larger value of the entropy means a higher
select the most informative instances from the pool to be                               level of uncertainty. Accordingly, this objective function
labeled by the annotator. Pool-based sampling is attractive for                         can be considered as a maximization problem.
many real-world learning scenarios as it is possible to collect                     •   Query-By-Committee (QBC): In QBC, an AL system
a large body of unlabeled data at once. Pool-based sampling                             consists of a committee of different learners trained on
presumes that a limited amount of labeled data and a big pool                           the current labeled data. These learners are then used
of unlabeled data are available.                                                        to make a prediction on the labels of unlabeled data.
                                                                                        The instances for which the committee members disagree
                                                                                        the most on the correct label are selected for labeling.
A. Active learning query strategies                                                     Then, the committee of learners will use the new labeled
   The fundamental question in AL is that what is the most                              data examples for training purposes. The QBC strategy
effective strategy for querying data instances? In NTC applica-                         creates wider diversity than UNC because it considers
tions, different query strategies can be used based on various                          the differences in the predictions of several different
network circumstances, e.g., new unknown flows, changes in                              learners, instead of measuring the level of uncertainty
the behavior of network traffic, and discovering unclassified                           of labeling using only a single learner. However, the
network traffics. We first, introduce the most well-known query                         technique for measuring the disagreement is often similar
strategies of AL widely used in literature and then evaluate                            for both query strategies [71]. In the QBC strategy,
                                                                                        the vote entropy and KL-divergence metrics are usually
                                                                                        applied to measure the disagreement. In the literature, to
               Train a model                            Requiring new data
                                                                                        construct a committee of learners, two major approaches
                                          Machine
                                       learning model                                   have been proposed. In the former, one can change the
                                                                                        parameters/hyperparameters of a particular model (e.g.,
                                                                                        by sampling) in order to generate different models and,
                                                                                        consequently, the committee models (or learners). In
                                                                      Unlabeled         contrast, in the latter, the committee is built by a bag
                                                                        data
          Labeled
           data
                                                                                        of different learners (i.e., ensemble of learners).
                                                                                    •   Learning Active Learning (LAL): The main idea behind
                                                                                        this strategy is to train a regressor that forecasts the Ex-
              A
               dd

                                                                    gy
                  in

                                                                                        pected error reduction (EER) for an instance in a specific
                                                                 te
                    g

                                                                 ra
                    to

                                                               st

                                                                                        learning state. Indeed, this technique formulates the query
                        th

                                                            ry
                         e

                                                          ue
                         tr
                             ai

                                                         Q

                                                                                        strategy of unlabeled data as a regression problem. Then,
                             ni
                               ng
                                  da

                                                                                        regarding a trained classifier and its output for specific
                                    ta

                                        Annotator                                       unlabelled instance, the Learning Active Learning (LAL)
                                    (human or machine)                                  forecasts the decrease in generalization error that can be
                                                                                        reached by labeling that instance. The interested readers
      Figure 2: Graphical description of active learning.                               are referred to read [72] for details.
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                                                                                      9

                 Train a model
                                          Machine                                                                    Train a model
                                       learning model                                                                                         Machine
                                                                                                                                           learning model
                                                                             Observe an                                                                                         Select the most
                                                                              instance                                                                                            informative
                                                                                                                                                                                    instance

            Labeled                                                      Make
                                                                                             Input source
             data                                                       decision                                Labeled                                                    Labeled
                                                                                                                 data                                                       data

                                                                        gy
                La he

                                                                       te

                                                                                                                   ad

                                                                                                                                                                           gy
                to

                                                                                   Discard

                                                                       ra
                  be tra

                                                                                                                      d

                                                                                                                                                                          te
                   t

                                                                     st

                                                                                                                     La e
                                                                                                                     to

                                                                                                                                                                        ra
                     l a ini

                                                                 ry

                                                                                                                                                                      st
                                                                                                                        be tra
                                                                                                                        th
                        nd ng

                                                                ue

                                                                                                                                                                   ry
                                                                                                                           l a ini
                          ad da

                                                            Q

                                                                                                                                                                 ue
                                                                                                                              nd ng
                             d ta

                                                                                                                                                                Q
                                                                                                                                 da
                                                                                                                                      ta
                                          Annotator
                                      (human or machine)                                                                                       Annotator
                                                                                                                                           (human or machine)

                                    (a) Stream-based sampling                                                                               (b) Pool-based sampling

                                      Figure 3: (a) Stream-based selective sampling, and (b) Pool-based sampling.

  •   Random: It refers to the conventional supervised learning                                         to increase the accuracy of the method or recognize new
      scheme in which instances are randomly selected to be                                             applications, protocols, or protocol versions. The update is
      labeled. Since data labeling is an expensive procedure,                                           essentially performed using new labeled data.
      random sampling may not lead to the best learner, es-                                                AL is a promising research field in this context as it greatly
      pecially when the query of each sample is costly, and                                             reduces the cost of training and dramatically speeds up the
      consequently, few labels will finally be available [71].                                          learning phase [74]. This is advantageous to ML-based traffic
  •   Information Density (Density): Uncertainty sampling,                                              classification methods to better satisfy the aforementioned
      QBC, and LAL query strategies are all prone to choosing                                           requirements, precisely data requirements and the need for
      outliers or unrepresentative instances and, consequently,                                         updating to identify new types of traffic through attaching
      this can lead to sub-optimal queries. A solution is to                                            labels on the most informative instances and the need for
      use the representativeness of an instance to ensure the                                           updating to identify new types of traffic.
      selected instances resemble the overall distribution. When
      considering whether to query an instance, a combination
                                                                                                        A. Advantages of using AL for NTC purposes
      of representativeness and the informativeness instances
      is typically used [73]. In the density query, to measure                                             AL is potentially a good candidate to perform NTC. Below,
      the representativeness of a data instance, the closeness                                          we summarize the advantage of using AL techniques in the
      of the data instance to all other data instances is often                                         field of NTC:
      considered.                                                                                          • Less amount of data needed for labeling: As mentioned
                                                                                                              before, most conventional networks generate unlabeled
       V. ACTIVE LEARNING FOR NETWORK TRAFFIC                                                                 and semi-labeled data. Meanwhile, one of the key chal-
                               CLASSIFICATION                                                                 lenges to use the learning-based techniques for NTMA
   As explained in Section III, NTC has attracted much in-                                                    is the lack or limited accessibility of labeled instances.
terest in recent years and different ML methods have been                                                     Moreover, data labeling is not often a straightforward
proposed to solve the NTC problem. However, most of these                                                     procedure and can raise the cost in terms of time, human
methods suffer from various challenges such as requiring                                                      effort, and the computational overhead. Other than that,
a large amount of fully labeled data, existence of a con-                                                     if data labeling is performed manually or by online
siderable amount of semi-labeled or unlabeled data in real-                                                   tools, it can reduce the data quality, since not all data
world network scenarios, and complex, costly, and time-                                                       instances are informative. AL can tackle this concern by
consuming methodology for data labeling. Providing labels to                                                  labeling only the most informative instances. To this end,
data instances is especially challenging for NTC techniques,                                                  a comprehensive set of querying strategies in AL has been
because one must consider several requirements in terms of                                                    proposed to determine the quality of instances for labeling
traffic data granularity in order to satisfy the desired traffic                                              [14].
classification objectives. One can, for example, refer to classes                                          • Concept Drift: Due to high dynamicity of computer
on the application level (i.e., Skype or Facebook), protocols                                                 networks, ML techniques must be re-trained frequently
level (i.e., TCP or HTTP), or at the service group level (i.e.,                                               because of various reasons, e.g., new network behavior
browsing or streaming) as typical examples of data granularity                                                and new classes of network traffic [75]. In most ML
[24]. Moreover, updating a traffic classification method is time-                                             techniques, such as DL, retraining a model from scratch
consuming. However, updating the models may be needed                                                         is a resource-intensive task in terms of time and power
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING                                                                      10

     computation in addition to their need for huge amount         B. Literature Review on using AL in NTC
     of new data samples. Most well-known ML techniques
     become useless in NTC as the network cannot be unat-             In this section, we review existing work on the application
     tended for a long time due to retraining purposes. AL is      of AL in NTC.
     able to (re-)train the models very fast with high accuracy       Torres textitet al. [80] proposed a botnet detection technique
     by continuous provisioning of new labeled instances. This     based on AL. The authors provided a novel AL strategy to
     is demonstrated in Section VI where AL performance is         label network traffic that contains normal and botnet traffics.
     evaluated with regard to the training time.                   The AL strategy is used to create a random forest model
 •   Dealing with the shortage of labeled data samples: In         that benefits from the user’s previously-labeled instances. The
     case of retraining, the number of labeled samples to train    primary objective of the proposed technique is to help the user
     the model is very limited due to the cost of labelling        in the labeling process. Similarly, the work in [81] employed
     process, e.g., time, complexity, need for domain knowl-       AL for a security purpose, i.e., malware classification. In this
     edge, etc. Most ML models, e.g., DL, need a considerable      work, SVMs and AL by learning (AL) have been combined
     amount of data to train. As shown in Section VI, AL           to tackle the lack of labeled instances in malware detection.
     can train the model with a high accuracy using a limited      The simulation results reveal that using AL can enhance the
     number of data samples.                                       performance of classification in terms of accuracy and the
 •   Incremental Learning: Although AL is not essentially          quality of labeled instances. In addition, the authors claimed
     considered as an online learning technique, using the         that by using different training algorithms, e.g., Generative
     stream-based sampling can possibly turns it into an           Adversarial Networks (GANs), one can solve issues such as
     incremental learning technique to be adaptable with the       the diversity of security-related datasets.
     nature of highly dynamic networks. As most of conven-            The work in [82] is another attempt to develop an accurate
     tional and emerging networking paradigms are highly           malware detection system. The system is based on AL, where
     dynamic in different aspects, AL can be used to learn         a new Structural Feature Extraction Methodology (SFEM) is
     the behavior of network traffic online. In addition, pool-    introduced to extract from docx files. The proposed system is
     based sampling can help reduce the time complexity of         able to identify new unknown malicious docx files. To have
     learning from scratch, as the number of training samples      an updatable detection model and identify new malicious files,
     becomes limited. Although labeling is a time-consuming        the system benefits from AL to update and complement the
     task, using different query strategies based on the network   signature database with new unknown malware.
     traffic circumstances can reduce the time complexity of          Common cybersecurity attack vectors, such as viruses, bot-
     learning.                                                     nets, and malware are known for Intrusion Detection Systems
 •   Monitoring incoming stream traffic: Using passive learn-      (IDSss). Nevertheless, malicious users continuously create
     ing methods for NTC tasks, such as security and intrusion     new attacks that can bypass the IDSss. Analyzing anoma-
     detection is no longer reasonable, as these methods           lous behaviors calls for a considerable amount of time and
     cannot handle changes in the statistical characteristics      effort. Preparing a significant of labeled data for the training
     of the target data (i.e., concept drift). To address this     process is both increasingly costly and inefficient, because
     issue, one can investigate the great abilities of stream-     of the continuous design of new attacks. In this case, one
     based AL [76]. Several AL-based strategies have been          can use AL to reduce the number of the required labeled
     proposed to detect concept drift and instantly adapt to       instances, while increasing the accuracy of anomaly detection.
     evolving characteristics of data [77] [78] [79].              In [83], a semi-supervised IDS has been designed that works
 •   Addressing Theory of network: In Internet Engineering         effectively with a small number of labeled instances. The
     Task Force 97 (IETF97)1 , the challenge is introduced as      proposed learning algorithm for the IDS benefits from two
     networks suffer from the lack of a unified theory that can    ML techniques, including AL Support Vector Machine (AL)
     be applied to all networks. It means that the behaviors of    and Fuzzy C-Means clustering. Furthermore, [83] reported
     different networks are various based on their topology,       that the proposed learning algorithm enables the IDSs to add
     equipment, scale, applications, etc. Theory of Network        new training instances with minimum computational overhead.
     causes an important problem that ML techniques should         Due to the fact that domain knowledge is required for the
     be trained for each network separately. AL can be con-        annotations of unlabeled instances, adopting new cost-effective
     sidered as a suitable online learning choice in such cases    labeling techniques is desired. To this end, the work in [84]
     thanks to its ability to be learned by a limited number       by Beaugnon et al. developed an interactive labeling strategy,
     of data samples. This is beneficial for highly dynamic        namely ILAB, to assist the experts in the labeling process of
     networks with a huge volume of starting and stopping          large intrusion detection datasets. ILAB adopts divide and con-
     network traffics. AL also allows frequent retraining which    quer approach to lower the computation cost. Deka et al. [85]
     eliminates the necessity of using representative datasets.    investigated the important role of AL in the selection of more
                                                                   informative instances. Then, they used these instances to train
                                                                   a binary IDSs for Distributed Denial of Service (DDoS) attack
                                                                   classification. In addition, since there are massive amounts of
                                                                   traffic in modern networks, a parallel computation method has
 1 https://www.ietf.org/blog/reflections-ietf-97/                  been employed. The authors referred to this fact that using AL
You can also read