Uplink vs. Downlink: Machine Learning-Based Quality Prediction for HTTP Adaptive Video Streaming - MDPI

Page created by Edwin Simon

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

sensors
Article
Uplink vs. Downlink: Machine Learning-Based Quality
Prediction for HTTP Adaptive Video Streaming
Frank Loh * , Fabian Poignée                   , Florian Wamser        , Ferdinand Leidinger and Tobias Hoßfeld

                                          Institute of Computer Science, University of Würzburg, 97074 Würzburg, Germany;
                                          fabian.poignee@informatik.uni-wuerzburg.de (F.P.); florian.wamser@informatik.uni-wuerzburg.de (F.W.);
                                          ferdinand.leidinger@informatik.uni-wuerzburg.de (F.L.); tobias.hossfeld@informatik.uni-wuerzburg.de (T.H.)
                                          * Correspondence: frank.loh@informatik.uni-wuerzburg.de

                                          Abstract: Streaming video is responsible for the bulk of Internet traffic these days. For this reason,
                                          Internet providers and network operators try to make predictions and assessments about the stream-
                                          ing quality for an end user. Current monitoring solutions are based on a variety of different machine
                                          learning approaches. The challenge for providers and operators nowadays is that existing approaches
                                          require large amounts of data. In this work, the most relevant quality of experience metrics, i.e., the
                                          initial playback delay, the video streaming quality, video quality changes, and video rebuffering
                                          events, are examined using a voluminous data set of more than 13,000 YouTube video streaming
                                          runs that were collected with the native YouTube mobile app. Three Machine Learning models are
                                          developed and compared to estimate playback behavior based on uplink request information. The
                                          main focus has been on developing a lightweight approach using as few features and as little data as
                                          possible, while maintaining state-of-the-art performance.
         
         
                                          Keywords: HTTP adaptive video streaming; quality of experience prediction; machine learning
Citation: Loh, F.; Poignée, F.;
Wamser, F.; Leidinger, F.; Hoßfeld, T.
Uplink vs. Downlink: Machine
Learning-Based Quality Prediction         1. Introduction
for HTTP Adaptive Video Streaming.
                                                The ongoing trend in social life to often use a virtual environment is accelerated
Sensors 2021, 21, 4172. https://
                                          by the COVID-19 pandemic. Throughout the last year in particular, work, social, and
doi.org/10.3390/s21124172
                                          leisure behaviors have changed rapidly towards the digital world. This development finds
Academic Editor: Nikolaos Thomos
                                          resonance in the May 2020 Sandvine report, which revealed that global Internet traffic was
                                          dominated by video, gaming, and social usage in particular, with these accounting for
Received: 15 April 2021                   more than 80 % of the total traffic [1], with YouTube hosting over 15 % of these volumes.
Accepted: 9 June 2021                           For video streaming, the Quality of Experience (QoE) is the most significant metric
Published: 17 June 2021                   for capturing the perceived quality for the end user. The initial playback delay, streaming
                                          quality, quality changes, and video rebuffering events are the most important influencing
Publisher’s Note: MDPI stays neutral      factors [2–4]. Due to the increasing demand, streaming platforms like YouTube and Netflix
with regard to jurisdictional claims in   have had to throttle the streaming quality in Europe in order to enable adequate quality for
published maps and institutional affil-   everybody on the Internet [5]. This affects the overall streaming QoE for all end users, and
iations.                                  ultimately the streaming provider’s revenue from long-term user churn.
                                                From the perspective of an Internet Service Provider (ISP), responsible for network
                                          monitoring, the goal is to satisfy their customers and operate economically. Intelligent
                                          and predictive service and network management is becoming more important to guar-
Copyright: c 2021 by the authors.         antee good streaming quality and meet user demands. However, since most of the data
Licensee MDPI, Basel, Switzerland.        traffic is encrypted these days, in-depth monitoring with deep packet inspection is no
This article is an open access article    longer possible for an ISP to determine crucial streaming related quality parameters. It is
distributed under the terms and           therefore necessary to predict quality by other flow monitoring and prediction techniques.
conditions of the Creative Commons        Furthermore, the increasing load and different volumes of flows, and consequently the
Attribution (CC BY) license (https://     processing power required to monitor each flow, make detailed prediction even more
creativecommons.org/licenses/by/          complex, especially for a centralized monitoring entity.
4.0/).

Sensors 2021, 21, 4172. https://doi.org/10.3390/s21124172                                                   https://www.mdpi.com/journal/sensors

Sensors 2021, 21, 4172 2 of 42

In this scenario, given the enormous number of different streaming sessions and
the associated humongous global data exchange, the analysis of every video stream at
packet level cannot be carried out. To enable monitoring in a decentralized way with
limited resources available in the last mile, a lightweight approach without unmanageable
overhead is important to improve current services. This leads to a scalable in-network QoE
monitoring at decentralized entities through a proper QoE prediction model.
This work reviews several different Machine Learning (ML) approaches to estimate the
initial playback delay, video playback quality, video quality changes, and video rebuffering
events using data originating from the native YouTube client on a mobile device. The goal
of this work is to investigate the following research questions:
1. Is estimation of the most important QoE parameters possible using only low-volume
uplink data?
2. What data is required and how much prediction accuracy impairment is acceptable with
only uplink data, compared to state-of-the-art downlink-based ML prediction approaches?
From our investigation, we show that the Random Forest (RF)-based approach in
particular shows promising results for all metrics. Using only uplink information, this
approach is lightweight compared to downlink-based predictions. Based on a dataset that
was measured over 28 months between November 2017 and April 2020, a total of over
13,000 video runs and 65 days of playback are evaluated. Studying the performance of
the models provides valuable insights, helping to draw conclusions on the applicability of
specific models and parameter options. The contribution of this work is threefold:
1. Three different general ML models are defined, compared, and evaluated on the entire
dataset with YouTube streaming data from the mobile app in order to predict the most
common QoE-relevant metrics. It is shown that even a simple RF-based approach
with selected features shows F1 scores of close to 0.90 to help decide whether video
rebuffering occurs during the download of a video chunk request.
2. Details of ML improvements are discussed. This defines a lightweight approach that
only uses uplink traffic information, but with only a marginal decrease in prediction
accuracy for a target metric. The learning is based on the fact that during a streaming
session, the uplink requests are sent in series to the content server to fill the client’s
buffer. After playback starts and the buffer is full, the behavior changes to an on–off
traffic pattern, which is characteristic for streaming today. This is utilized in the
learning algorithm, since the complete playback behavior is reflected in a change in
the inter-arrival times of the uplink requests.
3. The third contribution is to verify that the approach is practically feasible and re-
quires lesser monitoring resources by estimating the processing and monitoring effort
for random artificially-generated video content and defining a queuing model for
studying the load of a monitoring system.
Each model is validated using videos other than those utilized for testing, which
leads to only slightly worse results. As a result, our work is a valuable input for network
management to improve streaming quality for the end user. The remainder of this work
is structured as follows: in Section 2, background information is defined with a focus
on video streaming and streaming behavior used as information in the methodology in
this work to predict streaming quality. The streaming quality metrics are also introduced
here. Section 3 summarizes related work, first with a literature overview and afterwards
through a differentiation between uplink and complete packet trace monitoring, as done
in most related works by an effort analysis. Section 4 summarizes the measurement
process involving the testbed and the studied scenarios, and Section 5 presents the dataset.
A general overview of relevant information in the dataset is provided, with focus on the
relevant streaming quality metrics. The dataset is used to train the models presented
in Section 6, where additional information about the features and feature sets is given.
The results are discussed in Section 7 for each QoE relevant streaming metric. Section 8
concludes the paper.

Sensors 2021, 21, 4172 3 of 42

2. Background
This section contains important definitions and essential background information
on streaming, especially streaming-related terminology. The basic streaming process is
described, as are the individual phases that a streaming process undergoes. Different factors
influencing the overall streaming quality are introduced briefly, at the end of this section.

2.1. Terminology
Terms and concepts used in this work are defined in this subsection and differentiated
from each other. Accordingly, a short description of streaming is followed by explanation
of important terms from the network and application perspectives.

2.1.1. Video Streaming
Video streaming is the process of simultaneously playing back and downloading
audio and video content. Streaming is defined by the efficient downloading of the content
that is currently being played back by the user, which offers two advantages. From the
user’s point of view, playback can begin as soon as the content part matching the current
play time is loaded. From the provider’s perspective, only the part that is consumed
has to be downloaded, and therefore no resources are wasted. For streaming in general,
the available end-to-end throughput between content server and user must match the
video encoding rate. Any variation in the video encoding rate and network throughput
is balanced by a buffer. Playback continues as long as enough video and audio data are
available at the end-user’s player, otherwise the stream stops. These video interruptions
are called stalling. Thus, video streaming is a continuous request and download process.
Appendix A has more details about this process, and streaming in general.

2.1.2. Traffic Patterns
Traffic patterns in video streaming are the mapping of the application video behavior
to network traffic. A pattern is a sequence of network packets corresponding to the re-
quested network chunks. Different sequences of network chunks are requested in different
playback situations, depending on what the buffer state is and how much data is required
at the application level. From this, the application state can be estimated. The understand-
ing of the traffic patterns forms the basis for detecting the current buffer level, and thus
the overall player state leading to the playout quality. The time of content requests is
calculated and compared based on the traffic patterns. If the inter-request time of video
chunks is higher than the amount of video requested with each chunk, the buffer depletes.
If the inter-request time is shorter, the buffer is filled up. Consequently, it is possible to
draw conclusions to the buffer change, from the uplink request traffic pattern. Different
streaming phases are defined in the following paragraphs. Based on the traffic patterns in
these phases, the streaming behavior estimation process is done in this work.

2.2. Streaming Phase Definition
Streaming phases are states of the internal streaming process. At the user’s end, only
the effects and consequences of these phases on the streaming state are noticeable. To the
user that manifests as smooth streaming, a quality change, or a video rebuffering. From an
application point of view, two different phases in video streaming can be defined according
to [6], an initial buffering phase and a periodic buffer refill phase. In the initial buffering
phase, video data is requested and delivered in a best effort manner until the target buffer
threshold is reached. Then, the player switches to the periodic buffer refill phase. This
phase corresponds to the normal streaming behavior, where as much data is downloaded
as is extracted from the buffer for playback. The goal here is to keep the buffer slightly
above the target buffer threshold. Thus, the next data is only requested and downloaded
when the buffer level falls below the target buffer. As a result, there is no video data
downloaded as long as there is sufficient buffer. In contrast, the authors of [7] define three
phases. The initial buffering is called buffer filling, the periodic buffer refill steady-state.

Sensors 2021, 21, 4172                                                                                            4 of 42

                         Additionally, a buffer depletion phase is defined where the buffer level decreases, and thus
                         the amount of pre-buffered seconds also decreases.
                              For a detailed stalling analysis, the chunk request traffic patterns are analyzed in this
                         work for each of the streaming phases defined in [7], together with the pattern when the
                         playback is interrupted and the video stalls. This is essential for an insight as to the phases
                         with higher and lower influence on quality.

                         2.3. Quality Influencing Factors
                              Since this work deals with video playback quality estimation, this section introduces
                         factors that influence playback quality for the end user. According to [3], the most important
                         quality degradation factors for video streaming, influencing the QoE for the end user, are
                         the initial playback delay, the played out quality and playback quality changes, and stalling.

                         2.3.1. Initial Delay
                               Initial delay describes the waiting time between video request and video playback
                         start. According to [8], the initial delay can be divided into two parts, the initial page load
                         time and the time delay between the initial video request and the playback start. While the
                         initial page load time is not only influenced by the currently available network conditions
                         and the video content delivery network (CDN) state and quality, but also, among others,
                         by the smartphone type, quality, and memory usage, this work investigates the delay from
                         initial video request to playback start as initial delay.

                         2.3.2. Streaming Quality and Quality Changes
                              Next, the currently played out quality, especially the amount and frequency of quality
                         changes, influences the QoE for a streaming end user. Since, in HTTP Adaptive Streaming
                         (HAS), the goal is to adapt the playback bitrate to the currently available bandwidth,
                         especially in high variability bandwidth scenarios, frequent quality changes are possible.
                         However, since according to [3], a constant though lower quality is preferable over frequent
                         quality changes, but with a slightly higher average quality, the goal is to decrease the
                         degree of quality change to a minimum, while on the other hand increasing the overall
                         average quality. For that reason, in most streaming players and adaptation algorithms,
                         different metrics such as a quality change threshold are implemented, which define specific
                         rules to trigger a quality change.

                         2.3.3. Stalling
                              Stalling, the interruption of video playback, is the last metric influencing the video
                         QoE, which is discussed in this section. Since stalling has the highest negative influence on
                         the overall QoE [3], a detailed QoE detection or estimation is based on the possibility of
                         detecting or estimating stalling events. Furthermore, since, according to the literature, the
                         QoE degradation is worse with many small stalling events compared to fewer but longer
                         events [9], it is also important to detect small stalling instances, and short playback phases
                         between stalling events.

                         3. Related Work
                               There are different methodologies in the literature to quantify the video playback
                         quality, detect and estimate the metrics influencing the Quality of Service (QoS), and at
                         the end predict the perceived QoE for the end user. For that reason, this section covers
                         the literature discussing methods to detect or estimate quality degradation metrics, QoS
                         or QoE measurement or prediction approaches, and different ideas to quantify playback
                         quality. Furthermore, request and feature extraction approaches relevant for streaming
                         based works are presented. An overview of selected, especially relevant, related work
                         is summarized in Table 1. The main influencing factors as reflected in the table, which
                         differentiate approaches, are six: real time applicability, the target platform, the approach

Sensors 2021, 21, 4172                                                                                                                        5 of 42

                                    itself, the prediction goal, the data focus, and the target granularity from time window
                                    based approach to a complete session.

                                                 Table 1. Overview of select related work.

     Reference               Real Time           Target Platform               Approach        Prediction Goal       Data Focus       Granularity
   Mazhar’18         [10]       4              YouTube desktop                 tree-based             QoE               packet        time-window
  Wassermann’20      [11]       4           YouTube mobile, desktop                RF                 QoE               packet        time-window
    Orsolic’20       [12]       4              YouTube desktop                 tree-based          QoE/KPI              packet           session
                                                                                                 initial delay,
    Bronzino’19      [13]       8        YouTube, Netflix, Twitch desktop         RF                                    packet        time-window
                                                                                                  resolution
                                                                                               stalling, quality,
  Dimopoulos’16      [14]       8               YouTube desktop                   RF                                    packet          session
                                                                                               quality changes
                                                                                                 initial delay,
      Shen’20        [15]       4           YouTube, Bilibili desktop            CNN              resolution,           packet        time-window
                                                                                                    stalling
    Lopez’18         [16]       4                       -                      CNN, RNN               QoE               packet        time-window
    Mangla’18        [17]       8                       -                   session modeling          QoE           packet, request      session
                                                                                               buffer warning,
   Gutterman’20      [18]       4           YouTube mobile, desktop               RF              video state,         request        time-window
                                                                                                video quality
   Our work’21           –      4                YouTube mobile              RF, NN, LSTM             QoE              request        time-window

                                          There are many works in the literature describing, predicting, or analyzing the QoE.
                                    A comprehensive survey of QoE in HAS is given by Seufert in [3]. Besides knowing the
                                    influencing factors, another important goal is to measure all relevant parameters for good
                                    streaming behavior. One of the first works tackling streaming measurements calculating the
                                    application QoS and estimating the QoE based on a Mean Opinion Score (MOS) scale was
                                    published in 2011 by Mok [19]. However, with the widespread adoption of encryption in
                                    the Internet, and also in video streaming, the approaches became more complex, and simple
                                    Deep Packet Inspection (DPI) to receive the current buffer level was no longer possible.
                                          Nowadays, several challenges must be resolved for in-depth quality estimation in
                                    video streaming: First, data must be measured on a large scale. This is already being done
                                    for several platforms, including, among others, YouTube [20], Netflix and Hulu [21], and
                                    the Amazon Echo Show [22]. Afterwards, a flow separation and detection is essential to
                                    receive only the correct video flow for further prediction. Here, several approaches exist
                                    for streaming, e.g., [7,23], as ISP middle-box [24], or with different ML techniques [25].
                                          After determination of the correct video flow, the approach is defined. In modern
                                    streaming, two approach types are well adapted: session reconstruction and ML models.
                                    The session reconstruction-based approaches show that it is possible to predict all relevant
                                    QoE metrics. Whereas in [26], only stalling is predicted with session reconstruction, Mangla
                                    shows its utility in [17] for modeling other QoE metrics and comparing them to ML-based
                                    approaches. Similar works have been published by Schatz [27] and Dimopoulos [28] to
                                    estimate video stalls by packet trace inspection of HTTP logs.
                                          Especially in recent years, several ML based approaches have been published for
                                    video QoE estimation. For YouTube in particular, according to Table 1, an overall QoE
                                    prediction with real time approaches and focus on inspecting all network packets, is
                                    available from Mazhar [10], Wassermann [11], and Orsolic [12]. For all works, an in-depth
                                    analysis of voluminous network data is done, and more than 100 features each are extracted.
                                    While Mazhar only provides information about prediction results for the YouTube desktop
                                    version, the mobile version is also considered by Wassermann. Furthermore, Wassermann
                                    shows that a prediction of up to 1 s granularity is possible with a time-window based
                                    approach, with accuracies of more than 95%. The received recall for stalling as the most
                                    important QoE metric is 65% for their bagging approach and 55% for the RF approach. The
                                    precision is 87% and 88%, respectively. In this work, we receive macro average recall values
                                    of more than 80% and a comparable precision. Furthermore, Orsolic provides a framework
                                    that can also be utilized for other streaming platforms.
                                          Different approaches are adopted by, among others, Bronzino [13] and Dimopou-
                                    los [14]. Although the two works are not real time capable, Dimopoulos studies the QoE
                                    of YouTube with up to 70 features, while Bronzino also trains the presented model with

Sensors 2021, 21, 4172 6 of 42

Amazon and Twitch streaming data by calculating information like throughput, packet
count, or byte count.
Gutterman [18] has a different approach, in which, compared to the packet based
approaches, a video chunk-based approach is chosen, with four states: increasing buffer,
decreasing buffer, stalling, and steady buffer. The approach is similar to this work, while the
data resolution on application layer is higher in the approach from Gutterman. It is evident
that in Gutterman’s dataset the phases with dropping buffer are even more underrepre-
sented, from 5.9% to nearly 7%. In contrast, in the present work it is 9.9%. Furthermore,
Gutterman uses sliding windows up to a size of 20, including historic information up to
200 s. Overall, similar results are achieved. When we compare the video phase prediction
to the best results of Gutterman’s study that is received for the Requet App-LTE scenario,
slightly better precision and recall scores for the stalling phase are achieved by Gutterman.
The results for the steady phase and the recall for the buffer increase is similar. In the buffer
decay phase (depletion in this work), we outperform Gutterman by more than 6% in the
precision score and 50% in the recall score. Nevertheless, with the difference in dataset and
the amount of included historic information, a complete comparison is difficult.
While different ML algorithms are investigated, most of the related approaches men-
tioned use at least one type of tree-based ML model. This popularity is a result of multiple
factors, including ease of development and lower consumption of computational resources.
Moreover, the tree-based algorithms, e.g., RF, perform on a similar level or better than
others [11,13,18], when comparing multiple algorithms within one work.
Gutterman [18] investigates the performance of a NN, with similar results compared
to a tree-based approach. Shen [15] uses a Convolutional Neural Network (CNN) to infer
initial delay, resolution, and stalling. The real-time approach uses video data from YouTube
and Bilibili and predicts QoE-metrics for windows of 10 s length. Among the tested ML
algorithms, the CNN achieves the best results. The potential of deep learning to predict
QoE is emphasized further in the work of Lopez [16] by combining CNN and Recurrent
Neural Network (RNN).

3.1. Preliminary Considerations on Monitoring and Processing Effort
The approaches in the literature use different time windows and parameters of uplink
and downlink in different granularity. Given the varying degrees of complexity, there is a
trade-off between accuracy and computational effort.
This is especially important in terms of predicting streaming video parameters in
communication networks on a large, distributed scale. An efficient approach is usually
preferred here, which can be adopted on the last mile close to customers where little
computing power is available. Hence, the question arises, with regard to the related
work above: How much monitoring and computational effort is required to achieve high
accuracy with machine learning techniques for video streaming?
In preparation for the following study, we examine how much data might be generated
during video streaming of arbitrary videos and how much data needs to be monitored and
processed with regard to the various methods available in the literature. While there are a
large number of proposed approaches that detect quality, stalling, and the playback phases,
the approach in this work focuses exclusively on partial information for monitoring and
processing, e.g., the uplink information and the features generated from it.
In the following steps, 9518 random artificially generated videos are created according
to models from the literature [29,30], and the effort is quantified that is required to obtain
the specified partial information while streaming them (streaming simulation). Note that
this is meant to be a pre-study with focus on arbitrary video content. For the dataset used
for the actual study later, we refer to Sections 4.2 and 5 for more information on the dataset
and measurement scenarios. The generated data, scripts, and additional evaluation are
available in [31]. The aim is to show how much data and effort is required when machine
learning is used with partial data in comparison to a full monitoring. Later on, in the next

Sensors 2021, 21, 4172                                                                                                 7 of 42

                         sections, machine learning is carried out that quantifies the accuracy for different feature
                         sets on partial data.

                         3.1.1. Methodology
                              The basis for this evaluation is the modeling of different videos with data-rate models
                         generating video streaming content that is then streamed with the help of a simulation
                         of adaptive bitrate adaptation logics (ABRs), which are then mapped to requests at the
                         network layer in order to determine the amount of data while streaming. All traffic
                         models for H264/MPEG-like encoded variable bitrate video in the literature can be broadly
                         categorized into (1) data-rate models (DRMs) and (2) frame-size models (FSMs) [29]. FSMs
                         focus on generating single frames for video content in order to gain the ability to create
                         arbitrary yet realistic videos [30,32]. The approach works as follows: First, we create the
                         video content in different resolutions. This results in a variable bitrate per second video
                         content for each resolution. Afterwards, adaptation is applied to the video content with
                         ABRs. In this case, an adaption of three quality switches is applied with a duration of one
                         third of the total video length. Finally, this can be used to calculate how many bytes must
                         be processed and monitored if (1) the entire video has to be considered or (2) if only 5, 3, or
                         1 HTTP requests in up- and downlink have to be considered.

                         3.1.2. Results
                               Table 2 shows the generated video content according to [30,32]. We have generated
                         a total of 9518 videos from 9518 samples each for every common resolution from 144p
                         to 1080p. Figure 1a shows the empirical cumulative density function (CDF) for the bytes
                         required for streaming a randomly generated video. The individual lines in the plot
                         reflect the effort required to consider the downlink traffic, the uplink traffic, and the entire
                         corresponding traffic. Overall, the 50% quantile results in a traffic reduction of around 86%
                         if only the uplink has to be considered. This shows the advantage of approaches that only
                         require uplink data for monitoring and prediction. As suggested above, these approaches
                         are particularly interesting in practice when there are no large computational resources
                         available; for example, when making predictions close to the user within the last mile.
                               Figure 1b shows the data effort if only a time window of the last x HTTP requests
                         (1, 3, or 5 requests) is used. This is of particular interest for the prediction itself, since only
                         a small amount of data has to be processed here.

                         Table 2. Feasibility study: artificially generated video content; video duration: 5–15 min.

                                   Resolution/                         Bitrate [kbps]                         Size [MB]
                                   # Samples                    min/∅/max             Q10/Q90                 min/∅/max
                               144p              9518            32/114/637             70.2/167.6              2/8/20
                               244p              9518           175/256/749            212.3/310.1             7/19/33
                               360p              9518          489/570/1092            526.5/624.2             19/41/69
                               480p              9518         974/1057/1546           1013.5/1111.1           38/79/124
                               720p              9518         2091/2172/2661          2128.9/2225.2          79/160/249
                               1080p             9518         4575/4657/5183          4613.3/4710.2          173/338/529

                         3.2. Queuing Model of QoE Monitoring Entity
                              This statement is further evaluated by a simple queuing model studying the moni-
                         toring load of a monitoring entity for QoE prediction depending only on uplink traffic,
                         compared to a full packet trace consisting of uplink and downlink traffic. A visualization
                         of the model is given in Figure 2.

Sensors 2021, 21, 4172 8 of 42

D
(a) Uplink and downlink streaming data to be processed for (b) Data effort to be monitored or processed with time window
monitoring or prediction. approach.
Figure 1. Streaming simulation results.

3.2.1. General Modeling Approach
During traffic flow monitoring, all packets of a video stream arrive at the monitoring
instance, wait until the monitoring entity is free, and are subsequently processed. This is
described by a Markov model with a packet arrival process A, an unlimited queue, and
a processing unit B. The arrival process can be described as a Poisson process with rate
λ when a sufficient amount of flows are monitored in parallel [33]. This holds true for
this approach, since studying a small number of flows does not require this effort. The
processing unit B processes the packets in the system with rate µ. To analyze the model, a
general independent processing time distribution with expected average processing time
B = µ1 can be assumed. This holds true, for example, for a deterministic processing time
distribution. Then, the system can be described as an M/GI/1 − ∞ system. For a stable
system λ ≤ µ is given, which means the arrival rate must not exceed the processing rate,
in order to not overload the system. Furthermore, for real-time monitoring or to satisfy
specific network management service level agreements (SLA), a sojourn time or system
response time t < x can be defined, while x describes the delay in the system. For reasons
of simplicity and since the model is only used as an illustration, the following model is
analyzed as an M/M/1 − ∞ system with an exponential processing time distribution.

A B

Packets to be Analysis of
analyzed packets
Figure 2. Visualization of queuing model to quantify the monitoring load for the presented predic-
tion approach.
3.2.2. Modeling Results
For the usability study of the approach in a real environment, two different questions
are answered: (1) What is the average sojourn time or average response time of the system
E[ T ] dependent on the processing time of the processing unit within the monitoring entity;
and (2) How many video streams, or streaming flows, can be monitored in parallel without
exceeding a target delay with a probability q? Both questions are answered, based on the

Sensors 2021, 21, 4172                                                                                                                    9 of 42

                            arrival rates for uplink traffic (λulsim = 0.32/s) and the complete packet trace containing
                            uplink and downlink traffic (λ allsim = 108.22/s) received by the streaming simulation
                            above. Furthermore, the uplink traffic rate (λul = 0.22/s) and the complete uplink and
                            downlink traffic rate (λ all = 79.65/s) of the complete dataset used for this work are used.
                            A detailed overview of the dataset follows, in Section 5.
                                 To answer the first question, Little [34] is used to receive the average response time
                                                                    ρ
                            by E[ X ] = λ · E[ T ], with E[ X ] = 1−ρ as long-term average number of customers in a
                                                                            λ
                             stationary system and ρ =                      µ   as utilization of the processing unit. With Little [34], E[ T ]
                                                                    E[ B]
                            is received as E[ T ] =      with E[ B] = µ1 as average processing time of the processing
                                                                    1− ρ
                            unit within the monitoring entity. As processing time, a value range between 1 µs and
                            1 s is studied in Figure 3a at the x-axis. The y-axis shows the resulting response time of
                            the system. The yellow line shows the result when monitoring uplink and downlink data
                            considering the streaming simulation, whereas the orange line is the result considering the
                            full packet trace received by the dataset. The results are shown in brown when monitoring
                            only uplink traffic for the streaming simulation, and in black for the dataset. It is evident
                            that the average response time is similar up to a processing time of 10 ms per packet. For
                            slower systems, the processing entity of the monitoring system is not able to analyze all
                            packets of the complete stream anymore. Thus, the response time increases drastically.
                            When only the uplink data is to be monitored, the processing entity can process all data up
                            to a processing time of more than 1 s per packet. Thus, slower and cheaper hardware could
                            be used.

                                                      uplink dataset
                         response time [s]

                                             105      uplink stream. sim.
                                                      total dataset                                                            q =0.999

                                             101      total stream. sim                                       q =0.9

                                             10   3

                                                      10   3   10 1 101         103
                                                           processing time [ms]
                             (a) System response time comparison depen-                             (b) Possible amount of parallel supported
                             dent on processing time of monitoring entity                           flows dependent on a target delay of moni-
                             for uplink and all traffic.                                            toring entity for uplink and all traffic.
                             Figure 3. Modeling results overview.

                                  To answer the second question of how many streaming flows can be supported in
                             parallel, E[ T ] < x or P( T < x ) = q with x as target processing time or target delay in
                             seconds is introduced. Furthermore, q is the target probability describing the percentage
                             of packets that must be analyzed faster than x to guarantee specific SLAs. For positive
                             t-values, the distribution function for the response time is described by F (t) = 1 − e−(µ−λ)t .
                             Solving this, the supported positive arrival rate λ is dependent on the target probability q
                                                 ln(1−q)+µt
                             according to λ ≤         t     . Figure 3b shows the amount of flows supported in parallel
                             by a 1 G switch for monitoring only uplink traffic, compared to all uplink and downlink
                             packets based on a target delay x in milliseconds. The assumption is that the switch can
                             process 1 Gbps and the packets always fill the maximal transmission unit (MTU) in the
                             Internet, of 1500 B. The target probability q is set to 90% and 99.9%, respectively. It is
                             obvious that focusing on only uplink-based monitoring allows 100 to 1000 times more
                             flows compared to full packet trace monitoring. Furthermore, since there is an active
                             trend towards increasing data rates with higher resolutions and qualities, uplink-based
                             monitoring is a valuable methodology to deal with current and future video streams while
                             saving hardware costs or distributing monitoring entities.

Sensors 2021, 21, 4172 10 of 42

4. Measurement Process
In-depth QoE monitoring of YouTube adaptive streaming is difficult, especially for
a mobile device. For that reason, this section summarizes the measurement approach by
introducing the testbed and the measured scenarios.

4.1. Testbed
The testbed used for data collection in this work is installed at the University of
Würzburg. It consists of a Linux control PC running an Appium server, several Android
smartphones running an Appium client, and a Wi-Fi dongle. On the smartphones, videos
are streamed via the native YouTube application for Android. Next to the challenge of
encrypted network traffic used for YouTube streaming [35], the major challenge is to access
the internal client data of the mobile app. Compared to custom made apps, like the
YoMoApp [36], the approach presented in this work directly measures the native YouTube
app behavior with the Wrapper App [37]. This is done by accessing and controlling an
Android smartphone via USB and the Android Debugging Bridge (adb). In order to
manipulate the effective bandwidth for a measurement run, the Linux tools tc and NetEm
are used to shape the trace on the Internet link of the PC which acts as a proxy for the
mobile devices. The bandwidth for the smartphones is controlled accordingly. Note that
previous example measurements show a maximum incoming bandwidth of 400 Mbit/s and
always sufficient memory and CPU. For that reason, the control PC is not the bottleneck in
the streaming process. There is a dummy video between each run, to ensure a full reset of
the bandwidth conditions. The quality setting in the YouTube App is set to automatic mode
to investigate the adaptive video streaming mechanisms. There are two important sources
of data, the network and transport layer traces, which are captured with the tcpdump
tool and contain all the network data for a measurement run with a smartphone, and the
application layer data, which are obtained by repeatedly polling YouTube’s Stats for nerds.
More information about all available application parameters and the detailed Wrapper
App behavior are available in [37].

4.2. Measurement Scenarios
More than 75 different bandwidth scenarios are applied to study the streaming be-
havior of the native YouTube app. First, 38 constant bandwidth scenarios are measured to
analyze the general streaming and requesting behavior with the app. There, the bandwidth
ranges from 200 kbit/s to 2 Mbit/s in 100 kbit/s steps and from 2 Mbit/s to 10 Mbit/s in
500 kbit/s steps. Additionally, constant bandwidth settings of 20 Mbit/s, 25 Mbit/s, and
100 Mbit/s are studied to get insights into best case streaming situations. Second, specific
quality change and stalling scenarios are defined, where the bandwidth is lowered during
video playback. To force the player to request different qualities, several smaller and
larger variations in the bandwidth setting are applied, while scenarios with very abrupt
bandwidth degradation to 200 kbit/s or lower are used to trigger stalling in case of a
buffer underrun. Third, real bandwidth traces from mobile video streaming scenarios
according to [38,39] are used. The measurements show that the 4G traces of [38] are most
likely sufficient to stream the video in consistently high quality. Furthermore, no stalling
events could be detected applying these scenarios. In contrast, [39] show lower bandwidth
scenarios that most likely trigger quality changes. In most measurements, the quality is
automatically chosen by the YouTube App so as to not influence the common behavior
of the data requesting process. In all, 243 YouTube videos are measured for evaluation.
In addition to the 18 videos listed in [26], 225 additional random YouTube videos are
measured to cover a broad range of typical content. All data are aggregated to a large
dataset that serves as the basis for the machine learning process. The dataset is presented
in detail in the following paragraphs.

Sensors 2021, 21, 4172 11 of 42

5. Dataset
For an in-depth analysis of the streaming behavior, a dataset is required that covers
specific and regular behavior for all QoE relevant metrics, which are initial delay, streaming
quality, for YouTube the video resolution, quality changes, and stalling. For that reason,
a dataset is created in this work using the above-mentioned testbed and measurement
scenarios, which covers 13,759 valid video runs and more than 65 days of total playback
time. Parts of the dataset are already published for the research community [8,20]. In the
following subsection, first the dataset postprocessing is explained, giving details on net-
work and application data. Afterwards, a dataset overview is presented, with the main
focus being on streaming relevant metrics.

5.1. Postprocessing
The postprocessing of the application data and network traces in this work is twofold.
First, the network traces are analyzed and all video related flows are extracted. Afterwards,
the application data are postprocessed and combined information from the application and
the network is logged for further usage.

5.1.1. Network Data
The network data are captured in the pcap tcpdump format. By following the IP and
port addresses of the googlevideo.com DNS, all traffic associated with the YouTube video
can be separated from cross traffic like advertising, additional video recommendations,
or pictures. For each packet of this traffic, the packet arrival time, the packet source and
destination IP address and port, and the payload size in bytes are extracted.
Consecutive download of video data is required in video streaming to guarantee
smooth playback. Therefore, requests are detected in the uplink, asking for video data
from the YouTube server. Thus, all packets containing payload to the YouTube server are
extracted from the received packets and named as uplink chunk requests. Here, we assume
that the video is downloaded in a consecutive manner according to Figure 4. This means
request x is always requested before request x + 1 and downloaded completely. Thus,
all downlink traffic following a request in the same flow is marked as associated to that
specific request, before the next one is sent to the server. In this way, information such as the
amount of uplink and downlink traffic in bytes, amount of uplink and downlink packets,
and protocol associated with one request can be extracted. Furthermore, the last downlink
packet of request x, the end of downlink duration x in Figure 4, need not necessarily match
with the next uplink request timestamp.

uplink uplink
request x request x+1
chunk downlink x chunk downlink x+1

downlink downlink t
duration x duration x+1
inter request time (x, x+1) inter request time (x+1, x+2)
Figure 4. Timeline of YouTube chunk requests and associated data download.

5.1.2. Application Data
The application information Stats for nerds is available as a JSON-based file. There,
several application specific parameters are logged together with a timestamp in 1 s granu-
larity. Table 3 gives an overview of all parameters relevant for this work. The postprocessed
application data are added to the network data for further usage. The full dataset with all
information for each request used in the learning process in this work is available in [31].
Further details about data extraction are presented in [8,20].

Sensors 2021, 21, 4172 12 of 42

Table 3. Application data overview.

Value Explanation
timestamp timestamp of measurement log
fmt video format ID (itag)
fps frames per second
bh buffer health
df dropped frames and played out frames
videoID ID of played video

5.2. Data Overview
After data postprocessing, a more detailed overview of the dataset is visible, as shown
in Table 4. The table shows that 7015 of all videos, or 51.0%, have at least one change of the
fmt-value in Stats for nerds. This change is a playback resolution change according to the
YouTube HAS algorithm, also used as quality change. Thus, in the following paragraphs,
the terms resolution change and quality change are used interchangeably. The dataset
shows 22,015 quality changes in all. This is detectable by having different video format ID
values (fmt-values) in a single video run. Furthermore, 2961 videos or 21.5% have at least
one stalling event, accounting for a total of 5934 stalling instances and more than 32 h of
stalling time. The stalling is detected by analyzing the buffer health and the played out
frames. If the timestamp of the log is increasing faster than the frames are played out and
the buffer health is close to zero, the player is assumed to stall.

Table 4. General dataset overview.

Parameter Value
all runs 13,759
total runs quality change 7015
amount quality changes 22,015
total runs stalling 2961
total stalling events 5934
average video length 6 min 45 s
total number requests 1,142,635

The stalling duration varies from less then one second to 117 s for all videos in the
dataset. Short stalling of less than 5 s occurs in 20% of all stalling cases. The medium
stalling duration is 10.5 s and the mean is detected at 19.87 s. So as to have no conflict
with the initial playback delay, stalling within the first 5 s of video playback time is not
considered. Furthermore, for a learning approach with historic information as investigated
in this work, stalling prediction at the beginning of the video is not possible.
The dataset contains videos streamed with TCP and UDP, visible in the network
data. Among all runs, 22.9% are streamed with TCP, and the remainder with UDP.
A more detailed investigation of the main QoE degradation factors initial delay, play-
back quality, quality changes, and stalling in the dataset follows in the respective sections
of the methodology.
To acquire more information about the parameters influencing specific streaming relevant
metrics, first of all a dataset study is done. Since no video specific information like video
ID or resolution is available from encrypted network traffic, this is not studied in this work.
Furthermore, the available downlink bandwidth to stream the videos is not analyzed, since it
is not directly detectable from the traffic trace. Note that there is no explicit analysis of the
amount of totally downloaded and uploaded bytes for specific QoE relevant metrics, since
this is not the main focus of this work and is highly dependent on the video resolution. The
focus in the following paragraphs is on the most relevant influencing parameters with an
impact on the prediction result. Additional data set analysis is presented in Appendix B.

Sensors 2021, 21, 4172 13 of 42

5.3. Initial Delay
By using the first non-zero played out frames value, information is received when
the first playback is logged in the Stats for nerds data. Since the logging granularity there
is about 1 s, the exact play start cannot be extracted from the data. Using the number of
played out frames and the video fps, showing how many frames are played out within one
second, the initial playback start can be calculated. Furthermore, the initial video chunk
request timestamp is logged in the network data file. That is subtracted from the playback
start timestamp to receive the ground truth initial delay. The analysis shows that the mean
ground truth initial delay is 2.64 s.

Request Study
Since the main objective of this work is estimating QoE relevant metrics based on
uplink request, the required number of uplink requests to start the video playback are
studied for initial delay prediction. Therefore, all detected chunk requests before the
playback starts are summed up. Figure 5a shows in black the number of requests sent to
the YouTube server before the playback starts for all videos as CDF. The brown line shows
the number of requests completely downloaded according to the network data analysis
presented above. The figure shows that the playback can start when only a single request
is sent, and for a very little percentage of all videos before the first request is downloaded
completely. Furthermore, for more than 60% of all measured runs, five requests must be
sent to start the video and in total up to ten requests can be required until the initial video
playback starts. Since according to Figure 5a, the number of chunk request downloads
started before playback start can be higher than the number of chunk request downloads
ended, video playback can start in the middle of an open chunk request download. This
happened in 69.72% of all runs. However, no additional information is received by further
analyzing this behavior.

1.00 1.00
0.75 0.75 144p
240p
0.50
CDF

0.50 360p
CDF

480p
0.25 request start 0.25 720p
download ended 1080p
0.00 0.00
0 2 4 6 8 10 0 5 10 15
number chunks duration [s]
(a) Requests before video starts. (b) Inter request time by resolution.

Figure 5. Dataset study: initial delay and video resolution.

5.4. Video Quality
For an in-depth analysis of the streaming behavior, next to the initial delay, the video
quality is an important metric. The video quality is described by the overall playback
quality and the extent of quality changes. Since this work deals with the analysis of
playback quality with network data without studying the effect of packet loss, no frame
errors like blur or fragments in single frames are observed. Thus, only the YouTube format
ID as quality is taken into consideration in this work.
In total, more than 1.1 million requests are measured, as summarized in Table 5.
The table shows that, except for the 1444p resolution, more than 80,000 requests are
measured each. The highest representative is 720p, which is the target resolution for
many videos at the smartphone without triggering higher quality manually. Since only 841
requests are measured for 1440p video resolution and YouTube did not automatically trigger
1440p quality, in the following sections only quality up to 1080p is studied. Furthermore,
except for newest generation smartphones, which incidentally also start using 4K videos,
most smartphones do not support quality higher than 1080p. A screen resolution of 1440p

Sensors 2021, 21, 4172 14 of 42

or higher was used at end 2019 in less than 15% of all phones [40]. For that reason, 1440p
video resolutions are omitted from further evaluation.

Table 5. Video quality overview.

Resolution Total Requests Percentage
144p 156,031 13.93
240p 85,221 7.61
360p 109,223 9.75
480p 184,963 16.51
720p 487,717 43.53
1080p 96,516 8.61
1440p 841 0.08

To gain greater insight into the requesting behavior of the different quality represen-
tations within the dataset, the inter-request time of specific resolutions is studied in the
paragraphs that follow.

Inter-Request Time Study
Inter-request time is the main metric to describe the streaming process in this work.
An overview of the inter-request time for all available video resolutions is given in Figure 5b
as CDF. The x-axis shows the inter-request time in seconds, whereas the colors of the lines
represent the different resolutions, starting at 144p and ending with 1080p, with lighter
colors for higher resolutions.
The figure shows that the inter-request time for 10% to 20% of all requests, regardless
of resolution, is very small. The measurement shows that this behavior is visible most
likely at the beginning of the video, where many requests are sent. For lower resolution,
in the best effort phase at the beginning of the video, the data for a single request can be
downloaded very fast when enough bandwidth is available, since it is rather small. For that
reason, the next request is sent with only a short delay, leading to a short inter-request time.
Another explanation for this behavior is the parallel requesting of audio and video data.
A more detailed explanation of this is given in [26], although this plays no further part in
this work.
Figure 5b shows, for all resolutions, that more than 70% of all inter-request times are
between 0.1 s and 12 s. The rather linear distribution for resolution between 240p to 720p in
that time interval in particular is visible, with 240p having the lowest gradient. This means
that the inter-request times for all quality levels between 240p to 720p, discussed first in the
following paragraphs, show a similar distribution, with 240p having slightly fewer smaller
inter-request times compared to the others. For 240p, nearly 60% of all requests show an
inter-request time of less than 6 s, while for the others it is 65% to 70%.
The 144p and 1080p resolutions show a different behavior. While 144p shows more
instances of larger inter-request times, with only 38% smaller than 6 s, more instances of
smaller times are visible for 1080p, with 88% being below 6 s. This observation shows
the tendency of having larger inter-request times for poorer quality. One reason for this
behavior is the different amounts of data that must be downloaded for specific resolutions,
with less data for smaller ones. To minimize the overhead, more seconds of video playback
can be requested with a single request for lower resolution. Another reason for this
behavior is that lower resolution is used during phases with bandwidth issues. There,
the next request can be delayed because the download of the previous one is not finished.
Additionally, in the course of this work, the downlink duration and the downlink size
per request are studied. The results are similar to the inter-request time study, and thus
presented in the Appendix C.

Sensors 2021, 21, 4172 15 of 42

5.5. Quality Change
The other video quality based metric influencing the received streaming QoE of an
end user is quality change. According to the literature, the duration and the volume of
quality change are of interest [3]. Furthermore, the difference in quality before and after
the quality change is important; that is, whether the next higher or lower resolution is
chosen after a quality change or if available resolutions are skipped [41]. A more detailed
overview of quality change detected in the dataset is provided in the following lines.
In total, 22,123 instances of quality change are detected. By mapping these quality
changes to requests, 1.94% of all requests are marked as requests with changing quality.
A quality change is mapped to request x, if it occurs between the request of x and x + 1
according to Figure 4. An in-depth analysis of the quality before and after change is added
in Appendix D for interested readers.

5.6. Stalling
An overview of the stalling behavior is provided next, as the most important metric
influencing streaming quality degradation. Out of the 13,759 valid video runs in the
dataset, stalling is detected in 3328 runs, totaling 6450 stalls. In about 50% of the stalling
videos, one stalling event is detected, and a second event is detected in another 25% of the
runs. Only in less than 3% of all stalling videos are more than 5 stalling events measured.
Furthermore, in 40% of all videos where stalling is detected, two or more different qualities
were played out when stalling occurred. This means that, although the quality is adapted
to the currently available bandwidth, a stalling event occurred. With further regard to the
network parameters investigated in cases of stalling, the data show that stalling occurred
while starting the download of from one up to 30 requests. This means that stalling
continues although other video chunk requests are sent. For that reason, it might be not
sufficient to only monitor high inter-request times or delayed requests for stalling detection.
More details about stalling position, duration, and distribution across the qualities in the
dataset are available in Appendix E.

6. Methodology
This section presents the general methodology of this work. For each streaming quality
relevant metric discussed in this work, the prediction approach is presented. Initially, an
overview of all used features, general ML approach used for all target metrics, and general
data postprocessing is given. Afterwards, the initial delay and the quality prediction
are discussed in detail to begin with, since the estimation is done without additional
information of the streaming behavior. Then, the approach for quality change and stalling
prediction is introduced by presenting the idea with video phase detection, prediction, and
the correlation to specific uplink request patterns. Prediction specific information for these
approaches is given at the end of this section.

6.1. Feature Overview
To receive satisfactory and correct results in the ML process, a thorough feature
selection is required. The goal of this work is to keep the volume of network layer data and
thus processing time required for the analysis as low as possible without losing prediction
accuracy. Thus, only uplink traffic and aggregated downlink traffic are used, instead
of analyzing each downlink packet. Based on this, an overview of all features used for
prediction, with an explanation, is given in Table 6.
The request start timestamp f1 describes the moment in time when a request is sent to
the server relative to the initial request. Thus, the initial request is always set to zero. The
request size feature f4 describes the request size of the single uplink packet requesting a new
chunk. Note that a description of the other non-trivial features is given in Figure 4. For correct
ML behavior, all categorical features are one-hot-encoded, especially to avoid unwanted
behavior with TCP and QUIC. The ML models for the important streaming quality metrics
are presented based on the features, in the paragraphs that follow.

Sensors 2021, 21, 4172                                                                                               16 of 42

                         6.2. Feature Set Creation
                               The prediction quality is highly influenced by, among other things, a thorough feature
                         selection process. For that reason, several feature sets are created for each prediction goal,
                         with three main objectives: The full feature set for prediction with all available features,
                         the selected feature set using the features with the highest importance scores after an initial
                         importance ranking, and the uplink feature set with only uplink information. An overview
                         of all feature sets for all predicted metrics and the associated features is given in Table 7.

                         6.2.1. Full Feature Set
                               The full feature set contains all relevant features for each prediction case, and thus
                         all required for the initial delay prediction. To predict the other metrics, the request start
                         and the port are excluded. For the tasks, the duration is more important compared to the
                         actual timestamps of the request starts. While the port information could be important for
                         detecting server changes, the feature importance for the port feature is investigated, but
                         achieves a low score. Therefore, the port information is discarded, as the port is usually
                         randomly selected and therefore the ML model may overfit with this feature.

                         6.2.2. Selected Feature Set
                              To create the selected feature set, SelectKBest [42] is used to receive a feature importance
                         score with the full feature set as input. The result of the SelectKBest algorithm for all
                         predicted metrics is presented in Table 8. The mutual_info_regression score function is
                         used for the regression based initial delay prediction, whereas for all classification based
                         predictions, the mutual_info_classif score function is used.

                         Table 6. Feature overview.

                               Index               Feature                       Explanation

                               f1                  request start                 relative request timestamp
                               f2                  inter-request time            time between two requests
                               f3                  downlink duration             request to last downlink packet
                               f4                  request size [byte]           size of request packet
                               f5                  downlink [byte]               cum. downlink for request
                               f6                  downlink packets              packets downloaded for request
                               f7                  uplink [byte]                 cumulated uplink for request
                               f8                  uplink packets                uplink packets for request
                               f9                  port                          server port of video flow
                               f10                 protocol                      used network layer protocol

                         Table 7. Overview of all feature sets.

                              Metric                          Full             Selected            Uplink

                              initial delay                   f1–f10           f1–f8               f1, f2, f4, f7
                              quality                         f2–f8, f10       f2–f8               f2–f4, f7, f8, f10
                              streaming phase                 f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10
                              quality change                  f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10
                              stalling                        f2–f8, f10       f2, f3, f5          f2, f4, f7, f8, f10

You can also read