Uplink vs. Downlink: Machine Learning-Based Quality Prediction for HTTP Adaptive Video Streaming - MDPI

Page created by Edwin Simon
 
CONTINUE READING
sensors
Article
Uplink vs. Downlink: Machine Learning-Based Quality
Prediction for HTTP Adaptive Video Streaming
Frank Loh * , Fabian Poignée                   , Florian Wamser        , Ferdinand Leidinger and Tobias Hoßfeld

                                          Institute of Computer Science, University of Würzburg, 97074 Würzburg, Germany;
                                          fabian.poignee@informatik.uni-wuerzburg.de (F.P.); florian.wamser@informatik.uni-wuerzburg.de (F.W.);
                                          ferdinand.leidinger@informatik.uni-wuerzburg.de (F.L.); tobias.hossfeld@informatik.uni-wuerzburg.de (T.H.)
                                          * Correspondence: frank.loh@informatik.uni-wuerzburg.de

                                          Abstract: Streaming video is responsible for the bulk of Internet traffic these days. For this reason,
                                          Internet providers and network operators try to make predictions and assessments about the stream-
                                          ing quality for an end user. Current monitoring solutions are based on a variety of different machine
                                          learning approaches. The challenge for providers and operators nowadays is that existing approaches
                                          require large amounts of data. In this work, the most relevant quality of experience metrics, i.e., the
                                          initial playback delay, the video streaming quality, video quality changes, and video rebuffering
                                          events, are examined using a voluminous data set of more than 13,000 YouTube video streaming
                                          runs that were collected with the native YouTube mobile app. Three Machine Learning models are
                                          developed and compared to estimate playback behavior based on uplink request information. The
                                          main focus has been on developing a lightweight approach using as few features and as little data as
                                          possible, while maintaining state-of-the-art performance.
         
         
                                          Keywords: HTTP adaptive video streaming; quality of experience prediction; machine learning
Citation: Loh, F.; Poignée, F.;
Wamser, F.; Leidinger, F.; Hoßfeld, T.
Uplink vs. Downlink: Machine
Learning-Based Quality Prediction         1. Introduction
for HTTP Adaptive Video Streaming.
                                                The ongoing trend in social life to often use a virtual environment is accelerated
Sensors 2021, 21, 4172. https://
                                          by the COVID-19 pandemic. Throughout the last year in particular, work, social, and
doi.org/10.3390/s21124172
                                          leisure behaviors have changed rapidly towards the digital world. This development finds
Academic Editor: Nikolaos Thomos
                                          resonance in the May 2020 Sandvine report, which revealed that global Internet traffic was
                                          dominated by video, gaming, and social usage in particular, with these accounting for
Received: 15 April 2021                   more than 80 % of the total traffic [1], with YouTube hosting over 15 % of these volumes.
Accepted: 9 June 2021                           For video streaming, the Quality of Experience (QoE) is the most significant metric
Published: 17 June 2021                   for capturing the perceived quality for the end user. The initial playback delay, streaming
                                          quality, quality changes, and video rebuffering events are the most important influencing
Publisher’s Note: MDPI stays neutral      factors [2–4]. Due to the increasing demand, streaming platforms like YouTube and Netflix
with regard to jurisdictional claims in   have had to throttle the streaming quality in Europe in order to enable adequate quality for
published maps and institutional affil-   everybody on the Internet [5]. This affects the overall streaming QoE for all end users, and
iations.                                  ultimately the streaming provider’s revenue from long-term user churn.
                                                From the perspective of an Internet Service Provider (ISP), responsible for network
                                          monitoring, the goal is to satisfy their customers and operate economically. Intelligent
                                          and predictive service and network management is becoming more important to guar-
Copyright: c 2021 by the authors.         antee good streaming quality and meet user demands. However, since most of the data
Licensee MDPI, Basel, Switzerland.        traffic is encrypted these days, in-depth monitoring with deep packet inspection is no
This article is an open access article    longer possible for an ISP to determine crucial streaming related quality parameters. It is
distributed under the terms and           therefore necessary to predict quality by other flow monitoring and prediction techniques.
conditions of the Creative Commons        Furthermore, the increasing load and different volumes of flows, and consequently the
Attribution (CC BY) license (https://     processing power required to monitor each flow, make detailed prediction even more
creativecommons.org/licenses/by/          complex, especially for a centralized monitoring entity.
4.0/).

Sensors 2021, 21, 4172. https://doi.org/10.3390/s21124172                                                   https://www.mdpi.com/journal/sensors
Sensors 2021, 21, 4172                                                                                            2 of 42

                               In this scenario, given the enormous number of different streaming sessions and
                         the associated humongous global data exchange, the analysis of every video stream at
                         packet level cannot be carried out. To enable monitoring in a decentralized way with
                         limited resources available in the last mile, a lightweight approach without unmanageable
                         overhead is important to improve current services. This leads to a scalable in-network QoE
                         monitoring at decentralized entities through a proper QoE prediction model.
                               This work reviews several different Machine Learning (ML) approaches to estimate the
                         initial playback delay, video playback quality, video quality changes, and video rebuffering
                         events using data originating from the native YouTube client on a mobile device. The goal
                         of this work is to investigate the following research questions:
                         1.   Is estimation of the most important QoE parameters possible using only low-volume
                              uplink data?
                         2.   What data is required and how much prediction accuracy impairment is acceptable with
                              only uplink data, compared to state-of-the-art downlink-based ML prediction approaches?
                              From our investigation, we show that the Random Forest (RF)-based approach in
                         particular shows promising results for all metrics. Using only uplink information, this
                         approach is lightweight compared to downlink-based predictions. Based on a dataset that
                         was measured over 28 months between November 2017 and April 2020, a total of over
                         13,000 video runs and 65 days of playback are evaluated. Studying the performance of
                         the models provides valuable insights, helping to draw conclusions on the applicability of
                         specific models and parameter options. The contribution of this work is threefold:
                         1.   Three different general ML models are defined, compared, and evaluated on the entire
                              dataset with YouTube streaming data from the mobile app in order to predict the most
                              common QoE-relevant metrics. It is shown that even a simple RF-based approach
                              with selected features shows F1 scores of close to 0.90 to help decide whether video
                              rebuffering occurs during the download of a video chunk request.
                         2.   Details of ML improvements are discussed. This defines a lightweight approach that
                              only uses uplink traffic information, but with only a marginal decrease in prediction
                              accuracy for a target metric. The learning is based on the fact that during a streaming
                              session, the uplink requests are sent in series to the content server to fill the client’s
                              buffer. After playback starts and the buffer is full, the behavior changes to an on–off
                              traffic pattern, which is characteristic for streaming today. This is utilized in the
                              learning algorithm, since the complete playback behavior is reflected in a change in
                              the inter-arrival times of the uplink requests.
                         3.   The third contribution is to verify that the approach is practically feasible and re-
                              quires lesser monitoring resources by estimating the processing and monitoring effort
                              for random artificially-generated video content and defining a queuing model for
                              studying the load of a monitoring system.
                               Each model is validated using videos other than those utilized for testing, which
                         leads to only slightly worse results. As a result, our work is a valuable input for network
                         management to improve streaming quality for the end user. The remainder of this work
                         is structured as follows: in Section 2, background information is defined with a focus
                         on video streaming and streaming behavior used as information in the methodology in
                         this work to predict streaming quality. The streaming quality metrics are also introduced
                         here. Section 3 summarizes related work, first with a literature overview and afterwards
                         through a differentiation between uplink and complete packet trace monitoring, as done
                         in most related works by an effort analysis. Section 4 summarizes the measurement
                         process involving the testbed and the studied scenarios, and Section 5 presents the dataset.
                         A general overview of relevant information in the dataset is provided, with focus on the
                         relevant streaming quality metrics. The dataset is used to train the models presented
                         in Section 6, where additional information about the features and feature sets is given.
                         The results are discussed in Section 7 for each QoE relevant streaming metric. Section 8
                         concludes the paper.
Sensors 2021, 21, 4172                                                                                            3 of 42

                         2. Background
                              This section contains important definitions and essential background information
                         on streaming, especially streaming-related terminology. The basic streaming process is
                         described, as are the individual phases that a streaming process undergoes. Different factors
                         influencing the overall streaming quality are introduced briefly, at the end of this section.

                         2.1. Terminology
                              Terms and concepts used in this work are defined in this subsection and differentiated
                         from each other. Accordingly, a short description of streaming is followed by explanation
                         of important terms from the network and application perspectives.

                         2.1.1. Video Streaming
                              Video streaming is the process of simultaneously playing back and downloading
                         audio and video content. Streaming is defined by the efficient downloading of the content
                         that is currently being played back by the user, which offers two advantages. From the
                         user’s point of view, playback can begin as soon as the content part matching the current
                         play time is loaded. From the provider’s perspective, only the part that is consumed
                         has to be downloaded, and therefore no resources are wasted. For streaming in general,
                         the available end-to-end throughput between content server and user must match the
                         video encoding rate. Any variation in the video encoding rate and network throughput
                         is balanced by a buffer. Playback continues as long as enough video and audio data are
                         available at the end-user’s player, otherwise the stream stops. These video interruptions
                         are called stalling. Thus, video streaming is a continuous request and download process.
                         Appendix A has more details about this process, and streaming in general.

                         2.1.2. Traffic Patterns
                              Traffic patterns in video streaming are the mapping of the application video behavior
                         to network traffic. A pattern is a sequence of network packets corresponding to the re-
                         quested network chunks. Different sequences of network chunks are requested in different
                         playback situations, depending on what the buffer state is and how much data is required
                         at the application level. From this, the application state can be estimated. The understand-
                         ing of the traffic patterns forms the basis for detecting the current buffer level, and thus
                         the overall player state leading to the playout quality. The time of content requests is
                         calculated and compared based on the traffic patterns. If the inter-request time of video
                         chunks is higher than the amount of video requested with each chunk, the buffer depletes.
                         If the inter-request time is shorter, the buffer is filled up. Consequently, it is possible to
                         draw conclusions to the buffer change, from the uplink request traffic pattern. Different
                         streaming phases are defined in the following paragraphs. Based on the traffic patterns in
                         these phases, the streaming behavior estimation process is done in this work.

                         2.2. Streaming Phase Definition
                               Streaming phases are states of the internal streaming process. At the user’s end, only
                         the effects and consequences of these phases on the streaming state are noticeable. To the
                         user that manifests as smooth streaming, a quality change, or a video rebuffering. From an
                         application point of view, two different phases in video streaming can be defined according
                         to [6], an initial buffering phase and a periodic buffer refill phase. In the initial buffering
                         phase, video data is requested and delivered in a best effort manner until the target buffer
                         threshold is reached. Then, the player switches to the periodic buffer refill phase. This
                         phase corresponds to the normal streaming behavior, where as much data is downloaded
                         as is extracted from the buffer for playback. The goal here is to keep the buffer slightly
                         above the target buffer threshold. Thus, the next data is only requested and downloaded
                         when the buffer level falls below the target buffer. As a result, there is no video data
                         downloaded as long as there is sufficient buffer. In contrast, the authors of [7] define three
                         phases. The initial buffering is called buffer filling, the periodic buffer refill steady-state.
Sensors 2021, 21, 4172                                                                                            4 of 42

                         Additionally, a buffer depletion phase is defined where the buffer level decreases, and thus
                         the amount of pre-buffered seconds also decreases.
                              For a detailed stalling analysis, the chunk request traffic patterns are analyzed in this
                         work for each of the streaming phases defined in [7], together with the pattern when the
                         playback is interrupted and the video stalls. This is essential for an insight as to the phases
                         with higher and lower influence on quality.

                         2.3. Quality Influencing Factors
                              Since this work deals with video playback quality estimation, this section introduces
                         factors that influence playback quality for the end user. According to [3], the most important
                         quality degradation factors for video streaming, influencing the QoE for the end user, are
                         the initial playback delay, the played out quality and playback quality changes, and stalling.

                         2.3.1. Initial Delay
                               Initial delay describes the waiting time between video request and video playback
                         start. According to [8], the initial delay can be divided into two parts, the initial page load
                         time and the time delay between the initial video request and the playback start. While the
                         initial page load time is not only influenced by the currently available network conditions
                         and the video content delivery network (CDN) state and quality, but also, among others,
                         by the smartphone type, quality, and memory usage, this work investigates the delay from
                         initial video request to playback start as initial delay.

                         2.3.2. Streaming Quality and Quality Changes
                              Next, the currently played out quality, especially the amount and frequency of quality
                         changes, influences the QoE for a streaming end user. Since, in HTTP Adaptive Streaming
                         (HAS), the goal is to adapt the playback bitrate to the currently available bandwidth,
                         especially in high variability bandwidth scenarios, frequent quality changes are possible.
                         However, since according to [3], a constant though lower quality is preferable over frequent
                         quality changes, but with a slightly higher average quality, the goal is to decrease the
                         degree of quality change to a minimum, while on the other hand increasing the overall
                         average quality. For that reason, in most streaming players and adaptation algorithms,
                         different metrics such as a quality change threshold are implemented, which define specific
                         rules to trigger a quality change.

                         2.3.3. Stalling
                              Stalling, the interruption of video playback, is the last metric influencing the video
                         QoE, which is discussed in this section. Since stalling has the highest negative influence on
                         the overall QoE [3], a detailed QoE detection or estimation is based on the possibility of
                         detecting or estimating stalling events. Furthermore, since, according to the literature, the
                         QoE degradation is worse with many small stalling events compared to fewer but longer
                         events [9], it is also important to detect small stalling instances, and short playback phases
                         between stalling events.

                         3. Related Work
                               There are different methodologies in the literature to quantify the video playback
                         quality, detect and estimate the metrics influencing the Quality of Service (QoS), and at
                         the end predict the perceived QoE for the end user. For that reason, this section covers
                         the literature discussing methods to detect or estimate quality degradation metrics, QoS
                         or QoE measurement or prediction approaches, and different ideas to quantify playback
                         quality. Furthermore, request and feature extraction approaches relevant for streaming
                         based works are presented. An overview of selected, especially relevant, related work
                         is summarized in Table 1. The main influencing factors as reflected in the table, which
                         differentiate approaches, are six: real time applicability, the target platform, the approach
Sensors 2021, 21, 4172                                                                                                                        5 of 42

                                    itself, the prediction goal, the data focus, and the target granularity from time window
                                    based approach to a complete session.

                                                 Table 1. Overview of select related work.

     Reference               Real Time           Target Platform               Approach        Prediction Goal       Data Focus       Granularity
   Mazhar’18         [10]       4              YouTube desktop                 tree-based             QoE               packet        time-window
  Wassermann’20      [11]       4           YouTube mobile, desktop                RF                 QoE               packet        time-window
    Orsolic’20       [12]       4              YouTube desktop                 tree-based          QoE/KPI              packet           session
                                                                                                 initial delay,
    Bronzino’19      [13]       8        YouTube, Netflix, Twitch desktop         RF                                    packet        time-window
                                                                                                  resolution
                                                                                               stalling, quality,
  Dimopoulos’16      [14]       8               YouTube desktop                   RF                                    packet          session
                                                                                               quality changes
                                                                                                 initial delay,
      Shen’20        [15]       4           YouTube, Bilibili desktop            CNN              resolution,           packet        time-window
                                                                                                    stalling
    Lopez’18         [16]       4                       -                      CNN, RNN               QoE               packet        time-window
    Mangla’18        [17]       8                       -                   session modeling          QoE           packet, request      session
                                                                                               buffer warning,
   Gutterman’20      [18]       4           YouTube mobile, desktop               RF              video state,         request        time-window
                                                                                                video quality
   Our work’21           –      4                YouTube mobile              RF, NN, LSTM             QoE              request        time-window

                                          There are many works in the literature describing, predicting, or analyzing the QoE.
                                    A comprehensive survey of QoE in HAS is given by Seufert in [3]. Besides knowing the
                                    influencing factors, another important goal is to measure all relevant parameters for good
                                    streaming behavior. One of the first works tackling streaming measurements calculating the
                                    application QoS and estimating the QoE based on a Mean Opinion Score (MOS) scale was
                                    published in 2011 by Mok [19]. However, with the widespread adoption of encryption in
                                    the Internet, and also in video streaming, the approaches became more complex, and simple
                                    Deep Packet Inspection (DPI) to receive the current buffer level was no longer possible.
                                          Nowadays, several challenges must be resolved for in-depth quality estimation in
                                    video streaming: First, data must be measured on a large scale. This is already being done
                                    for several platforms, including, among others, YouTube [20], Netflix and Hulu [21], and
                                    the Amazon Echo Show [22]. Afterwards, a flow separation and detection is essential to
                                    receive only the correct video flow for further prediction. Here, several approaches exist
                                    for streaming, e.g., [7,23], as ISP middle-box [24], or with different ML techniques [25].
                                          After determination of the correct video flow, the approach is defined. In modern
                                    streaming, two approach types are well adapted: session reconstruction and ML models.
                                    The session reconstruction-based approaches show that it is possible to predict all relevant
                                    QoE metrics. Whereas in [26], only stalling is predicted with session reconstruction, Mangla
                                    shows its utility in [17] for modeling other QoE metrics and comparing them to ML-based
                                    approaches. Similar works have been published by Schatz [27] and Dimopoulos [28] to
                                    estimate video stalls by packet trace inspection of HTTP logs.
                                          Especially in recent years, several ML based approaches have been published for
                                    video QoE estimation. For YouTube in particular, according to Table 1, an overall QoE
                                    prediction with real time approaches and focus on inspecting all network packets, is
                                    available from Mazhar [10], Wassermann [11], and Orsolic [12]. For all works, an in-depth
                                    analysis of voluminous network data is done, and more than 100 features each are extracted.
                                    While Mazhar only provides information about prediction results for the YouTube desktop
                                    version, the mobile version is also considered by Wassermann. Furthermore, Wassermann
                                    shows that a prediction of up to 1 s granularity is possible with a time-window based
                                    approach, with accuracies of more than 95%. The received recall for stalling as the most
                                    important QoE metric is 65% for their bagging approach and 55% for the RF approach. The
                                    precision is 87% and 88%, respectively. In this work, we receive macro average recall values
                                    of more than 80% and a comparable precision. Furthermore, Orsolic provides a framework
                                    that can also be utilized for other streaming platforms.
                                          Different approaches are adopted by, among others, Bronzino [13] and Dimopou-
                                    los [14]. Although the two works are not real time capable, Dimopoulos studies the QoE
                                    of YouTube with up to 70 features, while Bronzino also trains the presented model with
Sensors 2021, 21, 4172                                                                                             6 of 42

                         Amazon and Twitch streaming data by calculating information like throughput, packet
                         count, or byte count.
                               Gutterman [18] has a different approach, in which, compared to the packet based
                         approaches, a video chunk-based approach is chosen, with four states: increasing buffer,
                         decreasing buffer, stalling, and steady buffer. The approach is similar to this work, while the
                         data resolution on application layer is higher in the approach from Gutterman. It is evident
                         that in Gutterman’s dataset the phases with dropping buffer are even more underrepre-
                         sented, from 5.9% to nearly 7%. In contrast, in the present work it is 9.9%. Furthermore,
                         Gutterman uses sliding windows up to a size of 20, including historic information up to
                         200 s. Overall, similar results are achieved. When we compare the video phase prediction
                         to the best results of Gutterman’s study that is received for the Requet App-LTE scenario,
                         slightly better precision and recall scores for the stalling phase are achieved by Gutterman.
                         The results for the steady phase and the recall for the buffer increase is similar. In the buffer
                         decay phase (depletion in this work), we outperform Gutterman by more than 6% in the
                         precision score and 50% in the recall score. Nevertheless, with the difference in dataset and
                         the amount of included historic information, a complete comparison is difficult.
                               While different ML algorithms are investigated, most of the related approaches men-
                         tioned use at least one type of tree-based ML model. This popularity is a result of multiple
                         factors, including ease of development and lower consumption of computational resources.
                         Moreover, the tree-based algorithms, e.g., RF, perform on a similar level or better than
                         others [11,13,18], when comparing multiple algorithms within one work.
                               Gutterman [18] investigates the performance of a NN, with similar results compared
                         to a tree-based approach. Shen [15] uses a Convolutional Neural Network (CNN) to infer
                         initial delay, resolution, and stalling. The real-time approach uses video data from YouTube
                         and Bilibili and predicts QoE-metrics for windows of 10 s length. Among the tested ML
                         algorithms, the CNN achieves the best results. The potential of deep learning to predict
                         QoE is emphasized further in the work of Lopez [16] by combining CNN and Recurrent
                         Neural Network (RNN).

                         3.1. Preliminary Considerations on Monitoring and Processing Effort
                               The approaches in the literature use different time windows and parameters of uplink
                         and downlink in different granularity. Given the varying degrees of complexity, there is a
                         trade-off between accuracy and computational effort.
                               This is especially important in terms of predicting streaming video parameters in
                         communication networks on a large, distributed scale. An efficient approach is usually
                         preferred here, which can be adopted on the last mile close to customers where little
                         computing power is available. Hence, the question arises, with regard to the related
                         work above: How much monitoring and computational effort is required to achieve high
                         accuracy with machine learning techniques for video streaming?
                               In preparation for the following study, we examine how much data might be generated
                         during video streaming of arbitrary videos and how much data needs to be monitored and
                         processed with regard to the various methods available in the literature. While there are a
                         large number of proposed approaches that detect quality, stalling, and the playback phases,
                         the approach in this work focuses exclusively on partial information for monitoring and
                         processing, e.g., the uplink information and the features generated from it.
                               In the following steps, 9518 random artificially generated videos are created according
                         to models from the literature [29,30], and the effort is quantified that is required to obtain
                         the specified partial information while streaming them (streaming simulation). Note that
                         this is meant to be a pre-study with focus on arbitrary video content. For the dataset used
                         for the actual study later, we refer to Sections 4.2 and 5 for more information on the dataset
                         and measurement scenarios. The generated data, scripts, and additional evaluation are
                         available in [31]. The aim is to show how much data and effort is required when machine
                         learning is used with partial data in comparison to a full monitoring. Later on, in the next
Sensors 2021, 21, 4172                                                                                                 7 of 42

                         sections, machine learning is carried out that quantifies the accuracy for different feature
                         sets on partial data.

                         3.1.1. Methodology
                              The basis for this evaluation is the modeling of different videos with data-rate models
                         generating video streaming content that is then streamed with the help of a simulation
                         of adaptive bitrate adaptation logics (ABRs), which are then mapped to requests at the
                         network layer in order to determine the amount of data while streaming. All traffic
                         models for H264/MPEG-like encoded variable bitrate video in the literature can be broadly
                         categorized into (1) data-rate models (DRMs) and (2) frame-size models (FSMs) [29]. FSMs
                         focus on generating single frames for video content in order to gain the ability to create
                         arbitrary yet realistic videos [30,32]. The approach works as follows: First, we create the
                         video content in different resolutions. This results in a variable bitrate per second video
                         content for each resolution. Afterwards, adaptation is applied to the video content with
                         ABRs. In this case, an adaption of three quality switches is applied with a duration of one
                         third of the total video length. Finally, this can be used to calculate how many bytes must
                         be processed and monitored if (1) the entire video has to be considered or (2) if only 5, 3, or
                         1 HTTP requests in up- and downlink have to be considered.

                         3.1.2. Results
                               Table 2 shows the generated video content according to [30,32]. We have generated
                         a total of 9518 videos from 9518 samples each for every common resolution from 144p
                         to 1080p. Figure 1a shows the empirical cumulative density function (CDF) for the bytes
                         required for streaming a randomly generated video. The individual lines in the plot
                         reflect the effort required to consider the downlink traffic, the uplink traffic, and the entire
                         corresponding traffic. Overall, the 50% quantile results in a traffic reduction of around 86%
                         if only the uplink has to be considered. This shows the advantage of approaches that only
                         require uplink data for monitoring and prediction. As suggested above, these approaches
                         are particularly interesting in practice when there are no large computational resources
                         available; for example, when making predictions close to the user within the last mile.
                               Figure 1b shows the data effort if only a time window of the last x HTTP requests
                         (1, 3, or 5 requests) is used. This is of particular interest for the prediction itself, since only
                         a small amount of data has to be processed here.

                         Table 2. Feasibility study: artificially generated video content; video duration: 5–15 min.

                                   Resolution/                         Bitrate [kbps]                         Size [MB]
                                   # Samples                    min/∅/max             Q10/Q90                 min/∅/max
                               144p              9518            32/114/637             70.2/167.6              2/8/20
                               244p              9518           175/256/749            212.3/310.1             7/19/33
                               360p              9518          489/570/1092            526.5/624.2             19/41/69
                               480p              9518         974/1057/1546           1013.5/1111.1           38/79/124
                               720p              9518         2091/2172/2661          2128.9/2225.2          79/160/249
                               1080p             9518         4575/4657/5183          4613.3/4710.2          173/338/529

                         3.2. Queuing Model of QoE Monitoring Entity
                              This statement is further evaluated by a simple queuing model studying the moni-
                         toring load of a monitoring entity for QoE prediction depending only on uplink traffic,
                         compared to a full packet trace consisting of uplink and downlink traffic. A visualization
                         of the model is given in Figure 2.
Sensors 2021, 21, 4172                                                                                                    8 of 42

    D
(a) Uplink and downlink streaming data to be processed for      (b) Data effort to be monitored or processed with time window
monitoring or prediction.                                       approach.
                                          Figure 1. Streaming simulation results.

                               3.2.1. General Modeling Approach
                                    During traffic flow monitoring, all packets of a video stream arrive at the monitoring
                               instance, wait until the monitoring entity is free, and are subsequently processed. This is
                               described by a Markov model with a packet arrival process A, an unlimited queue, and
                               a processing unit B. The arrival process can be described as a Poisson process with rate
                               λ when a sufficient amount of flows are monitored in parallel [33]. This holds true for
                               this approach, since studying a small number of flows does not require this effort. The
                               processing unit B processes the packets in the system with rate µ. To analyze the model, a
                               general independent processing time distribution with expected average processing time
                               B = µ1 can be assumed. This holds true, for example, for a deterministic processing time
                               distribution. Then, the system can be described as an M/GI/1 − ∞ system. For a stable
                               system λ ≤ µ is given, which means the arrival rate must not exceed the processing rate,
                               in order to not overload the system. Furthermore, for real-time monitoring or to satisfy
                               specific network management service level agreements (SLA), a sojourn time or system
                               response time t < x can be defined, while x describes the delay in the system. For reasons
                               of simplicity and since the model is only used as an illustration, the following model is
                               analyzed as an M/M/1 − ∞ system with an exponential processing time distribution.

                                      A                                                                         B

                                                                   Packets to be                           Analysis of
                                                                   analyzed                                packets
                               Figure 2. Visualization of queuing model to quantify the monitoring load for the presented predic-
                               tion approach.
                               3.2.2. Modeling Results
                                      For the usability study of the approach in a real environment, two different questions
                               are answered: (1) What is the average sojourn time or average response time of the system
                               E[ T ] dependent on the processing time of the processing unit within the monitoring entity;
                               and (2) How many video streams, or streaming flows, can be monitored in parallel without
                               exceeding a target delay with a probability q? Both questions are answered, based on the
Sensors 2021, 21, 4172                                                                                                                    9 of 42

                            arrival rates for uplink traffic (λulsim = 0.32/s) and the complete packet trace containing
                            uplink and downlink traffic (λ allsim = 108.22/s) received by the streaming simulation
                            above. Furthermore, the uplink traffic rate (λul = 0.22/s) and the complete uplink and
                            downlink traffic rate (λ all = 79.65/s) of the complete dataset used for this work are used.
                            A detailed overview of the dataset follows, in Section 5.
                                 To answer the first question, Little [34] is used to receive the average response time
                                                                    ρ
                            by E[ X ] = λ · E[ T ], with E[ X ] = 1−ρ as long-term average number of customers in a
                                                                            λ
                             stationary system and ρ =                      µ   as utilization of the processing unit. With Little [34], E[ T ]
                                                                    E[ B]
                            is received as E[ T ] =      with E[ B] = µ1 as average processing time of the processing
                                                                    1− ρ
                            unit within the monitoring entity. As processing time, a value range between 1 µs and
                            1 s is studied in Figure 3a at the x-axis. The y-axis shows the resulting response time of
                            the system. The yellow line shows the result when monitoring uplink and downlink data
                            considering the streaming simulation, whereas the orange line is the result considering the
                            full packet trace received by the dataset. The results are shown in brown when monitoring
                            only uplink traffic for the streaming simulation, and in black for the dataset. It is evident
                            that the average response time is similar up to a processing time of 10 ms per packet. For
                            slower systems, the processing entity of the monitoring system is not able to analyze all
                            packets of the complete stream anymore. Thus, the response time increases drastically.
                            When only the uplink data is to be monitored, the processing entity can process all data up
                            to a processing time of more than 1 s per packet. Thus, slower and cheaper hardware could
                            be used.

                                                      uplink dataset
                         response time [s]

                                             105      uplink stream. sim.
                                                      total dataset                                                            q =0.999

                                             101      total stream. sim                                       q =0.9

                                             10   3

                                                      10   3   10 1 101         103
                                                           processing time [ms]
                             (a) System response time comparison depen-                             (b) Possible amount of parallel supported
                             dent on processing time of monitoring entity                           flows dependent on a target delay of moni-
                             for uplink and all traffic.                                            toring entity for uplink and all traffic.
                             Figure 3. Modeling results overview.

                                  To answer the second question of how many streaming flows can be supported in
                             parallel, E[ T ] < x or P( T < x ) = q with x as target processing time or target delay in
                             seconds is introduced. Furthermore, q is the target probability describing the percentage
                             of packets that must be analyzed faster than x to guarantee specific SLAs. For positive
                             t-values, the distribution function for the response time is described by F (t) = 1 − e−(µ−λ)t .
                             Solving this, the supported positive arrival rate λ is dependent on the target probability q
                                                 ln(1−q)+µt
                             according to λ ≤         t     . Figure 3b shows the amount of flows supported in parallel
                             by a 1 G switch for monitoring only uplink traffic, compared to all uplink and downlink
                             packets based on a target delay x in milliseconds. The assumption is that the switch can
                             process 1 Gbps and the packets always fill the maximal transmission unit (MTU) in the
                             Internet, of 1500 B. The target probability q is set to 90% and 99.9%, respectively. It is
                             obvious that focusing on only uplink-based monitoring allows 100 to 1000 times more
                             flows compared to full packet trace monitoring. Furthermore, since there is an active
                             trend towards increasing data rates with higher resolutions and qualities, uplink-based
                             monitoring is a valuable methodology to deal with current and future video streams while
                             saving hardware costs or distributing monitoring entities.
Sensors 2021, 21, 4172                                                                                        10 of 42

                         4. Measurement Process
                              In-depth QoE monitoring of YouTube adaptive streaming is difficult, especially for
                         a mobile device. For that reason, this section summarizes the measurement approach by
                         introducing the testbed and the measured scenarios.

                         4.1. Testbed
                              The testbed used for data collection in this work is installed at the University of
                         Würzburg. It consists of a Linux control PC running an Appium server, several Android
                         smartphones running an Appium client, and a Wi-Fi dongle. On the smartphones, videos
                         are streamed via the native YouTube application for Android. Next to the challenge of
                         encrypted network traffic used for YouTube streaming [35], the major challenge is to access
                         the internal client data of the mobile app. Compared to custom made apps, like the
                         YoMoApp [36], the approach presented in this work directly measures the native YouTube
                         app behavior with the Wrapper App [37]. This is done by accessing and controlling an
                         Android smartphone via USB and the Android Debugging Bridge (adb). In order to
                         manipulate the effective bandwidth for a measurement run, the Linux tools tc and NetEm
                         are used to shape the trace on the Internet link of the PC which acts as a proxy for the
                         mobile devices. The bandwidth for the smartphones is controlled accordingly. Note that
                         previous example measurements show a maximum incoming bandwidth of 400 Mbit/s and
                         always sufficient memory and CPU. For that reason, the control PC is not the bottleneck in
                         the streaming process. There is a dummy video between each run, to ensure a full reset of
                         the bandwidth conditions. The quality setting in the YouTube App is set to automatic mode
                         to investigate the adaptive video streaming mechanisms. There are two important sources
                         of data, the network and transport layer traces, which are captured with the tcpdump
                         tool and contain all the network data for a measurement run with a smartphone, and the
                         application layer data, which are obtained by repeatedly polling YouTube’s Stats for nerds.
                         More information about all available application parameters and the detailed Wrapper
                         App behavior are available in [37].

                         4.2. Measurement Scenarios
                              More than 75 different bandwidth scenarios are applied to study the streaming be-
                         havior of the native YouTube app. First, 38 constant bandwidth scenarios are measured to
                         analyze the general streaming and requesting behavior with the app. There, the bandwidth
                         ranges from 200 kbit/s to 2 Mbit/s in 100 kbit/s steps and from 2 Mbit/s to 10 Mbit/s in
                         500 kbit/s steps. Additionally, constant bandwidth settings of 20 Mbit/s, 25 Mbit/s, and
                         100 Mbit/s are studied to get insights into best case streaming situations. Second, specific
                         quality change and stalling scenarios are defined, where the bandwidth is lowered during
                         video playback. To force the player to request different qualities, several smaller and
                         larger variations in the bandwidth setting are applied, while scenarios with very abrupt
                         bandwidth degradation to 200 kbit/s or lower are used to trigger stalling in case of a
                         buffer underrun. Third, real bandwidth traces from mobile video streaming scenarios
                         according to [38,39] are used. The measurements show that the 4G traces of [38] are most
                         likely sufficient to stream the video in consistently high quality. Furthermore, no stalling
                         events could be detected applying these scenarios. In contrast, [39] show lower bandwidth
                         scenarios that most likely trigger quality changes. In most measurements, the quality is
                         automatically chosen by the YouTube App so as to not influence the common behavior
                         of the data requesting process. In all, 243 YouTube videos are measured for evaluation.
                         In addition to the 18 videos listed in [26], 225 additional random YouTube videos are
                         measured to cover a broad range of typical content. All data are aggregated to a large
                         dataset that serves as the basis for the machine learning process. The dataset is presented
                         in detail in the following paragraphs.
Sensors 2021, 21, 4172                                                                                             11 of 42

                         5. Dataset
                              For an in-depth analysis of the streaming behavior, a dataset is required that covers
                         specific and regular behavior for all QoE relevant metrics, which are initial delay, streaming
                         quality, for YouTube the video resolution, quality changes, and stalling. For that reason,
                         a dataset is created in this work using the above-mentioned testbed and measurement
                         scenarios, which covers 13,759 valid video runs and more than 65 days of total playback
                         time. Parts of the dataset are already published for the research community [8,20]. In the
                         following subsection, first the dataset postprocessing is explained, giving details on net-
                         work and application data. Afterwards, a dataset overview is presented, with the main
                         focus being on streaming relevant metrics.

                         5.1. Postprocessing
                               The postprocessing of the application data and network traces in this work is twofold.
                         First, the network traces are analyzed and all video related flows are extracted. Afterwards,
                         the application data are postprocessed and combined information from the application and
                         the network is logged for further usage.

                         5.1.1. Network Data
                              The network data are captured in the pcap tcpdump format. By following the IP and
                         port addresses of the googlevideo.com DNS, all traffic associated with the YouTube video
                         can be separated from cross traffic like advertising, additional video recommendations,
                         or pictures. For each packet of this traffic, the packet arrival time, the packet source and
                         destination IP address and port, and the payload size in bytes are extracted.
                              Consecutive download of video data is required in video streaming to guarantee
                         smooth playback. Therefore, requests are detected in the uplink, asking for video data
                         from the YouTube server. Thus, all packets containing payload to the YouTube server are
                         extracted from the received packets and named as uplink chunk requests. Here, we assume
                         that the video is downloaded in a consecutive manner according to Figure 4. This means
                         request x is always requested before request x + 1 and downloaded completely. Thus,
                         all downlink traffic following a request in the same flow is marked as associated to that
                         specific request, before the next one is sent to the server. In this way, information such as the
                         amount of uplink and downlink traffic in bytes, amount of uplink and downlink packets,
                         and protocol associated with one request can be extracted. Furthermore, the last downlink
                         packet of request x, the end of downlink duration x in Figure 4, need not necessarily match
                         with the next uplink request timestamp.

                           uplink                                      uplink
                         request x                                  request x+1
                                      chunk downlink x                        chunk downlink x+1

                                          downlink                                 downlink                  t
                                          duration x                             duration x+1
                                      inter request time (x, x+1)            inter request time (x+1, x+2)
                         Figure 4. Timeline of YouTube chunk requests and associated data download.

                         5.1.2. Application Data
                               The application information Stats for nerds is available as a JSON-based file. There,
                         several application specific parameters are logged together with a timestamp in 1 s granu-
                         larity. Table 3 gives an overview of all parameters relevant for this work. The postprocessed
                         application data are added to the network data for further usage. The full dataset with all
                         information for each request used in the learning process in this work is available in [31].
                         Further details about data extraction are presented in [8,20].
Sensors 2021, 21, 4172                                                                                            12 of 42

                         Table 3. Application data overview.

                                   Value                            Explanation
                                   timestamp                        timestamp of measurement log
                                   fmt                              video format ID (itag)
                                   fps                              frames per second
                                   bh                               buffer health
                                   df                               dropped frames and played out frames
                                   videoID                          ID of played video

                         5.2. Data Overview
                               After data postprocessing, a more detailed overview of the dataset is visible, as shown
                         in Table 4. The table shows that 7015 of all videos, or 51.0%, have at least one change of the
                         fmt-value in Stats for nerds. This change is a playback resolution change according to the
                         YouTube HAS algorithm, also used as quality change. Thus, in the following paragraphs,
                         the terms resolution change and quality change are used interchangeably. The dataset
                         shows 22,015 quality changes in all. This is detectable by having different video format ID
                         values (fmt-values) in a single video run. Furthermore, 2961 videos or 21.5% have at least
                         one stalling event, accounting for a total of 5934 stalling instances and more than 32 h of
                         stalling time. The stalling is detected by analyzing the buffer health and the played out
                         frames. If the timestamp of the log is increasing faster than the frames are played out and
                         the buffer health is close to zero, the player is assumed to stall.

                         Table 4. General dataset overview.

                                       Parameter                                                Value
                                       all runs                                                 13,759
                                       total runs quality change                                7015
                                       amount quality changes                                   22,015
                                       total runs stalling                                      2961
                                       total stalling events                                    5934
                                       average video length                                     6 min 45 s
                                       total number requests                                    1,142,635

                               The stalling duration varies from less then one second to 117 s for all videos in the
                         dataset. Short stalling of less than 5 s occurs in 20% of all stalling cases. The medium
                         stalling duration is 10.5 s and the mean is detected at 19.87 s. So as to have no conflict
                         with the initial playback delay, stalling within the first 5 s of video playback time is not
                         considered. Furthermore, for a learning approach with historic information as investigated
                         in this work, stalling prediction at the beginning of the video is not possible.
                               The dataset contains videos streamed with TCP and UDP, visible in the network
                         data. Among all runs, 22.9% are streamed with TCP, and the remainder with UDP.
                         A more detailed investigation of the main QoE degradation factors initial delay, play-
                         back quality, quality changes, and stalling in the dataset follows in the respective sections
                         of the methodology.
                               To acquire more information about the parameters influencing specific streaming relevant
                         metrics, first of all a dataset study is done. Since no video specific information like video
                         ID or resolution is available from encrypted network traffic, this is not studied in this work.
                         Furthermore, the available downlink bandwidth to stream the videos is not analyzed, since it
                         is not directly detectable from the traffic trace. Note that there is no explicit analysis of the
                         amount of totally downloaded and uploaded bytes for specific QoE relevant metrics, since
                         this is not the main focus of this work and is highly dependent on the video resolution. The
                         focus in the following paragraphs is on the most relevant influencing parameters with an
                         impact on the prediction result. Additional data set analysis is presented in Appendix B.
Sensors 2021, 21, 4172                                                                                                     13 of 42

                          5.3. Initial Delay
                               By using the first non-zero played out frames value, information is received when
                         the first playback is logged in the Stats for nerds data. Since the logging granularity there
                         is about 1 s, the exact play start cannot be extracted from the data. Using the number of
                         played out frames and the video fps, showing how many frames are played out within one
                         second, the initial playback start can be calculated. Furthermore, the initial video chunk
                         request timestamp is logged in the network data file. That is subtracted from the playback
                         start timestamp to receive the ground truth initial delay. The analysis shows that the mean
                         ground truth initial delay is 2.64 s.

                          Request Study
                              Since the main objective of this work is estimating QoE relevant metrics based on
                         uplink request, the required number of uplink requests to start the video playback are
                         studied for initial delay prediction. Therefore, all detected chunk requests before the
                         playback starts are summed up. Figure 5a shows in black the number of requests sent to
                         the YouTube server before the playback starts for all videos as CDF. The brown line shows
                         the number of requests completely downloaded according to the network data analysis
                         presented above. The figure shows that the playback can start when only a single request
                         is sent, and for a very little percentage of all videos before the first request is downloaded
                         completely. Furthermore, for more than 60% of all measured runs, five requests must be
                         sent to start the video and in total up to ten requests can be required until the initial video
                         playback starts. Since according to Figure 5a, the number of chunk request downloads
                         started before playback start can be higher than the number of chunk request downloads
                         ended, video playback can start in the middle of an open chunk request download. This
                         happened in 69.72% of all runs. However, no additional information is received by further
                         analyzing this behavior.

                               1.00                                                    1.00
                               0.75                                                    0.75                              144p
                                                                                                                         240p
                                                                                       0.50
                                                                                 CDF

                               0.50                                                                                      360p
                         CDF

                                                                                                                         480p
                               0.25                request start                       0.25                              720p
                                                   download ended                                                        1080p
                               0.00                                                    0.00
                                      0   2   4      6       8    10                          0         5           10           15
                                            number chunks                                                duration [s]
                                  (a) Requests before video starts.                      (b) Inter request time by resolution.

                          Figure 5. Dataset study: initial delay and video resolution.

                          5.4. Video Quality
                              For an in-depth analysis of the streaming behavior, next to the initial delay, the video
                         quality is an important metric. The video quality is described by the overall playback
                         quality and the extent of quality changes. Since this work deals with the analysis of
                         playback quality with network data without studying the effect of packet loss, no frame
                         errors like blur or fragments in single frames are observed. Thus, only the YouTube format
                         ID as quality is taken into consideration in this work.
                              In total, more than 1.1 million requests are measured, as summarized in Table 5.
                         The table shows that, except for the 1444p resolution, more than 80,000 requests are
                         measured each. The highest representative is 720p, which is the target resolution for
                         many videos at the smartphone without triggering higher quality manually. Since only 841
                         requests are measured for 1440p video resolution and YouTube did not automatically trigger
                         1440p quality, in the following sections only quality up to 1080p is studied. Furthermore,
                         except for newest generation smartphones, which incidentally also start using 4K videos,
                         most smartphones do not support quality higher than 1080p. A screen resolution of 1440p
Sensors 2021, 21, 4172                                                                                            14 of 42

                         or higher was used at end 2019 in less than 15% of all phones [40]. For that reason, 1440p
                         video resolutions are omitted from further evaluation.

                         Table 5. Video quality overview.

                                  Resolution                    Total Requests                     Percentage
                                  144p                          156,031                            13.93
                                  240p                          85,221                             7.61
                                  360p                          109,223                            9.75
                                  480p                          184,963                            16.51
                                  720p                          487,717                            43.53
                                  1080p                         96,516                             8.61
                                  1440p                         841                                0.08

                              To gain greater insight into the requesting behavior of the different quality represen-
                         tations within the dataset, the inter-request time of specific resolutions is studied in the
                         paragraphs that follow.

                         Inter-Request Time Study
                               Inter-request time is the main metric to describe the streaming process in this work.
                         An overview of the inter-request time for all available video resolutions is given in Figure 5b
                         as CDF. The x-axis shows the inter-request time in seconds, whereas the colors of the lines
                         represent the different resolutions, starting at 144p and ending with 1080p, with lighter
                         colors for higher resolutions.
                               The figure shows that the inter-request time for 10% to 20% of all requests, regardless
                         of resolution, is very small. The measurement shows that this behavior is visible most
                         likely at the beginning of the video, where many requests are sent. For lower resolution,
                         in the best effort phase at the beginning of the video, the data for a single request can be
                         downloaded very fast when enough bandwidth is available, since it is rather small. For that
                         reason, the next request is sent with only a short delay, leading to a short inter-request time.
                         Another explanation for this behavior is the parallel requesting of audio and video data.
                         A more detailed explanation of this is given in [26], although this plays no further part in
                         this work.
                               Figure 5b shows, for all resolutions, that more than 70% of all inter-request times are
                         between 0.1 s and 12 s. The rather linear distribution for resolution between 240p to 720p in
                         that time interval in particular is visible, with 240p having the lowest gradient. This means
                         that the inter-request times for all quality levels between 240p to 720p, discussed first in the
                         following paragraphs, show a similar distribution, with 240p having slightly fewer smaller
                         inter-request times compared to the others. For 240p, nearly 60% of all requests show an
                         inter-request time of less than 6 s, while for the others it is 65% to 70%.
                               The 144p and 1080p resolutions show a different behavior. While 144p shows more
                         instances of larger inter-request times, with only 38% smaller than 6 s, more instances of
                         smaller times are visible for 1080p, with 88% being below 6 s. This observation shows
                         the tendency of having larger inter-request times for poorer quality. One reason for this
                         behavior is the different amounts of data that must be downloaded for specific resolutions,
                         with less data for smaller ones. To minimize the overhead, more seconds of video playback
                         can be requested with a single request for lower resolution. Another reason for this
                         behavior is that lower resolution is used during phases with bandwidth issues. There,
                         the next request can be delayed because the download of the previous one is not finished.
                         Additionally, in the course of this work, the downlink duration and the downlink size
                         per request are studied. The results are similar to the inter-request time study, and thus
                         presented in the Appendix C.
Sensors 2021, 21, 4172                                                                                              15 of 42

                         5.5. Quality Change
                              The other video quality based metric influencing the received streaming QoE of an
                         end user is quality change. According to the literature, the duration and the volume of
                         quality change are of interest [3]. Furthermore, the difference in quality before and after
                         the quality change is important; that is, whether the next higher or lower resolution is
                         chosen after a quality change or if available resolutions are skipped [41]. A more detailed
                         overview of quality change detected in the dataset is provided in the following lines.
                              In total, 22,123 instances of quality change are detected. By mapping these quality
                         changes to requests, 1.94% of all requests are marked as requests with changing quality.
                         A quality change is mapped to request x, if it occurs between the request of x and x + 1
                         according to Figure 4. An in-depth analysis of the quality before and after change is added
                         in Appendix D for interested readers.

                         5.6. Stalling
                               An overview of the stalling behavior is provided next, as the most important metric
                         influencing streaming quality degradation. Out of the 13,759 valid video runs in the
                         dataset, stalling is detected in 3328 runs, totaling 6450 stalls. In about 50% of the stalling
                         videos, one stalling event is detected, and a second event is detected in another 25% of the
                         runs. Only in less than 3% of all stalling videos are more than 5 stalling events measured.
                         Furthermore, in 40% of all videos where stalling is detected, two or more different qualities
                         were played out when stalling occurred. This means that, although the quality is adapted
                         to the currently available bandwidth, a stalling event occurred. With further regard to the
                         network parameters investigated in cases of stalling, the data show that stalling occurred
                         while starting the download of from one up to 30 requests. This means that stalling
                         continues although other video chunk requests are sent. For that reason, it might be not
                         sufficient to only monitor high inter-request times or delayed requests for stalling detection.
                         More details about stalling position, duration, and distribution across the qualities in the
                         dataset are available in Appendix E.

                         6. Methodology
                              This section presents the general methodology of this work. For each streaming quality
                         relevant metric discussed in this work, the prediction approach is presented. Initially, an
                         overview of all used features, general ML approach used for all target metrics, and general
                         data postprocessing is given. Afterwards, the initial delay and the quality prediction
                         are discussed in detail to begin with, since the estimation is done without additional
                         information of the streaming behavior. Then, the approach for quality change and stalling
                         prediction is introduced by presenting the idea with video phase detection, prediction, and
                         the correlation to specific uplink request patterns. Prediction specific information for these
                         approaches is given at the end of this section.

                         6.1. Feature Overview
                              To receive satisfactory and correct results in the ML process, a thorough feature
                         selection is required. The goal of this work is to keep the volume of network layer data and
                         thus processing time required for the analysis as low as possible without losing prediction
                         accuracy. Thus, only uplink traffic and aggregated downlink traffic are used, instead
                         of analyzing each downlink packet. Based on this, an overview of all features used for
                         prediction, with an explanation, is given in Table 6.
                              The request start timestamp f1 describes the moment in time when a request is sent to
                         the server relative to the initial request. Thus, the initial request is always set to zero. The
                         request size feature f4 describes the request size of the single uplink packet requesting a new
                         chunk. Note that a description of the other non-trivial features is given in Figure 4. For correct
                         ML behavior, all categorical features are one-hot-encoded, especially to avoid unwanted
                         behavior with TCP and QUIC. The ML models for the important streaming quality metrics
                         are presented based on the features, in the paragraphs that follow.
Sensors 2021, 21, 4172                                                                                               16 of 42

                         6.2. Feature Set Creation
                               The prediction quality is highly influenced by, among other things, a thorough feature
                         selection process. For that reason, several feature sets are created for each prediction goal,
                         with three main objectives: The full feature set for prediction with all available features,
                         the selected feature set using the features with the highest importance scores after an initial
                         importance ranking, and the uplink feature set with only uplink information. An overview
                         of all feature sets for all predicted metrics and the associated features is given in Table 7.

                         6.2.1. Full Feature Set
                               The full feature set contains all relevant features for each prediction case, and thus
                         all required for the initial delay prediction. To predict the other metrics, the request start
                         and the port are excluded. For the tasks, the duration is more important compared to the
                         actual timestamps of the request starts. While the port information could be important for
                         detecting server changes, the feature importance for the port feature is investigated, but
                         achieves a low score. Therefore, the port information is discarded, as the port is usually
                         randomly selected and therefore the ML model may overfit with this feature.

                         6.2.2. Selected Feature Set
                              To create the selected feature set, SelectKBest [42] is used to receive a feature importance
                         score with the full feature set as input. The result of the SelectKBest algorithm for all
                         predicted metrics is presented in Table 8. The mutual_info_regression score function is
                         used for the regression based initial delay prediction, whereas for all classification based
                         predictions, the mutual_info_classif score function is used.

                         Table 6. Feature overview.

                               Index               Feature                       Explanation

                               f1                  request start                 relative request timestamp
                               f2                  inter-request time            time between two requests
                               f3                  downlink duration             request to last downlink packet
                               f4                  request size [byte]           size of request packet
                               f5                  downlink [byte]               cum. downlink for request
                               f6                  downlink packets              packets downloaded for request
                               f7                  uplink [byte]                 cumulated uplink for request
                               f8                  uplink packets                uplink packets for request
                               f9                  port                          server port of video flow
                               f10                 protocol                      used network layer protocol

                         Table 7. Overview of all feature sets.

                              Metric                          Full             Selected            Uplink

                              initial delay                   f1–f10           f1–f8               f1, f2, f4, f7
                              quality                         f2–f8, f10       f2–f8               f2–f4, f7, f8, f10
                              streaming phase                 f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10
                              quality change                  f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10
                              stalling                        f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10
You can also read