Application failure predictions from neural networks analyzing telemetry data - Filip Hultgren Max Rylander

Page created by Shawn Osborne
 
CONTINUE READING
Application failure predictions from neural networks analyzing telemetry data - Filip Hultgren Max Rylander
UPTEC IT 21016

                                     Examensarbete 30 hp
                                               Juni 2021

Application failure predictions
from neural networks analyzing
telemetry data

Filip Hultgren
Max Rylander

                 Institutionen för informationsteknologi
                    Department of Information Technology
Abstract
                                      Application failure predictions from neural networks
                                      analyzing telemetry data
                                      Filip Hultgren Max Rylander

Teknisk- naturvetenskaplig fakultet
UTH-enheten                           With the revolution of the internet, new applications have emerged in our daily life.
                                      People are dependent on services for transportation, bank matters, and
Besöksadress:                         communication. Services availability is crucial for their survival and competition against
Ångströmlaboratoriet
Lägerhyddsvägen 1                     other service providers. Achieving good availability is a challenging task. The latest
Hus 4, Plan 0                         trend is migrating systems to the cloud. The cloud provides numerous methods to
                                      prevent downtimes, such as auto-scaling, continuous deployment, continuous
Postadress:                           monitoring, and more. However, failures can still occur even though the preemptive
Box 536
751 21 Uppsala                        techniques fulfill their purpose. Monitoring the system gives insights into the system's
                                      actual state, but it is up to the maintainer to interpret these insights. This thesis
Telefon:                              investigates how machine learning can predict future crashes of Kubernetes pods
018 – 471 30 03                       based on the metrics collected from them. At the start of the project, there was no
Telefax:                              available data on pod crashes, and the solution was to simulate a 10-tier microservice
018 – 471 30 00                       system in a Kubernetes cluster to create generic data. The project applies two
                                      different models, a Random Forest model and a Temporal Convolutional Networks
Hemsida:                              model, where the first-mentioned acted as a baseline model. They predict if a failure
http://www.teknat.uu.se/student
                                      will occur within a given prediction time window based upon a 15-minutes of data.
                                      The project evaluated three different prediction time windows. The five-minute
                                      prediction time window resulted in the best foresight based on the models' accuracy.
                                      The Random Forest model achieved an accuracy of 73.4 %, while the TCN model
                                      achieved an accuracy of 77.7 %. Predictions of the models can act as an early alert of
                                      incoming failure, which the system or a maintainer can act upon to improve the
                                      availability of its system.

                                      Handledare: Johan Hernefeldt
                                      Ämnesgranskare: Filip Malmberg
                                      Examinator: Lars-Åke Nordén
                                      ISSN: 1401-5749, UPTEC IT 21016
                                      Tryckt av: Reprocentralen ITC
Sammanfattning

Med internets revolution uppstår nya applikationer i våra dagliga liv. Människor är be-
roende av tjänster såsom transportation, bankärenden, och kommunikation. Tjänsternas
tillgänglighet är avgörande för applikationens överlevnad och kamp mot konkurrenter.
Att uppnå en god tillgänglighet är en utmaning. Den senaste trenden är migrering av
system till molnet. Molnet erbjuder flera metoder för att undvika driftstopp såsom auto-
skalning, kontinuerlig driftsättning, kontinuerlig övervakning, och mer. Men fel i syste-
met kan inträffa oavsett de förebyggande tekniker uppnår deras ändamål. Övervakning
av system ger inblick av systemets tillstånd men det är upp till underhållaren att tolka
denna information. Detta examensarbete undersöker hur maskininlärning can förutsäga
framtida Kubernetes pod krascher baserat på övervakningsmetrikerna från respekti-
ve pod. Det existerade ingen tillgänlig data för pod krascher i början av projektet.
Lösningen var att simulera ett 10-nivå mikrotjänst system i ett Kubernetes kluster för att
generare generisk data. Detta projekt applicerar två olika modeller, en Random Forest
modell och en Temporal Convolutional Networks modell, varav den förstnämnda är en
basmodell. Modellerna förutser om en krasch sker inom ett tidsintervall baserat på 15
minuters intervall av data. Detta examensarbete evaluerade tre olika prediktionsintervall.
Det 5 miuters prediktionsintervallet resulterade i den bästa prognosen baserat på model-
lernas precision. Random Forest modellen uppnådde en noggrannhet på 73.4% medans
TCN modellen uppnåde en noggrannhet på 77.7%. Denna förutsägelse kan användas
som en tidig varning för ett inkommande fel som systemet eller underhållaren kan agera
på för att förbättra systemets tillgänlighet.

                                              ii
Acknowledgement

Throughout the writing of this thesis, we have received a great deal of support and
shared knowledge. We would like to thank our supervisor Johan Hernefeldt at Telia,
whose guidance was invaluable for the result of the thesis. Your expertise in the area and
your ability to share and exhibit its value have given us invaluable insight and knowledge
to understand modern technologies.

                                           iii
Contents

1   Introduction                                                                         1

2   Background                                                                           3
    2.1   Cloud Native’s origin . . . . . . . . . . . . . . . . . . . . . . . . . . .     3
    2.2   Orchestration Framework - Kubernetes . . . . . . . . . . . . . . . . . .        3
    2.3   Monitoring in Cloud Environment . . . . . . . . . . . . . . . . . . . .         4
    2.4   Metrics in a Cloud Environment . . . . . . . . . . . . . . . . . . . . .        6
    2.5   Resilience Engineering . . . . . . . . . . . . . . . . . . . . . . . . . .      7
    2.6   Chaos Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . .       7

3   Theory                                                                               9
    3.1   Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . .     9
    3.2   Convolution Neural Networks . . . . . . . . . . . . . . . . . . . . . .        12
          3.2.1    Temporal Convolutional Networks . . . . . . . . . . . . . . . .       14
          3.2.2    Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . .   16
    3.3   Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   17
          3.3.1    Ensemble learning - Random Forest . . . . . . . . . . . . . . .       17
    3.4   Regularization techniques . . . . . . . . . . . . . . . . . . . . . . . . .    18
          3.4.1    Label Smoothing . . . . . . . . . . . . . . . . . . . . . . . . .     18
          3.4.2    One cycle training . . . . . . . . . . . . . . . . . . . . . . . .    19
          3.4.3    Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   19
          3.4.4    Layer normalization . . . . . . . . . . . . . . . . . . . . . . .     20

4   Related work                                                                         21
    4.1   Predicting Node Failure in Cloud Service Systems . . . . . . . . . . . .       21

                                             iv
4.2   Failure Prediction in Hardware Systems . . . . . . . . . . . . . . . . .       22
    4.3   System-level hardware failure prediction using deep learning . . . . . .       22
    4.4   Predicting Software Anomalies using Machine Learning Techniques . .            23
    4.5   Netflix ChAP - The Chaos Automation Platform . . . . . . . . . . . . .         24

5   Methodology                                                                          25
    5.1   AWS Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      25
          5.1.1   Boutique - Sample Application By Google . . . . . . . . . . .          25
          5.1.2   System monitoring - Prometheus . . . . . . . . . . . . . . . . .       26
          5.1.3   Locust - load framework . . . . . . . . . . . . . . . . . . . . .      30
          5.1.4   Provoking pod failures . . . . . . . . . . . . . . . . . . . . . .     31
    5.2   Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   32
          5.2.1   Collecting data . . . . . . . . . . . . . . . . . . . . . . . . . .    32
          5.2.2   Prometheus queries . . . . . . . . . . . . . . . . . . . . . . . .     33
          5.2.3   Data formatting . . . . . . . . . . . . . . . . . . . . . . . . . .    35
    5.3   Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   37
          5.3.1   Data loaders . . . . . . . . . . . . . . . . . . . . . . . . . . .     37
          5.3.2   Random Forest model . . . . . . . . . . . . . . . . . . . . . .        38
          5.3.3   Temporal Convolutional Networks model . . . . . . . . . . . .          38
    5.4   Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   40

6   Result                                                                               41
    6.1   Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41
    6.2   Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41

7   Discussion                                                                           48

                                             v
7.1   Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    48
    7.2   Choice of monitoring metrics . . . . . . . . . . . . . . . . . . . . . . .     50
    7.3   Data pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    50
    7.4   Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   51
          7.4.1   Random Forest model . . . . . . . . . . . . . . . . . . . . . .        51
          7.4.2   Temporal Convolutional Networks model . . . . . . . . . . . .          52
    7.5   Improving resilience with failure prediction . . . . . . . . . . . . . . .     52

8   Conclusion                                                                           54

9   Future work                                                                          55

A Data set                                                                               60
    A.1 Value distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     60
          A.1.1 Crash data samples . . . . . . . . . . . . . . . . . . . . . . . .       60
          A.1.2 Healthy data samples . . . . . . . . . . . . . . . . . . . . . . .       67

                                            vi
1   Introduction

1    Introduction

The requirements for modern distributed systems continue to increase, and to manage
the change, systems are migrated to cloud-native architecture to achieve numerous ad-
vantages such as low cost, scalability, and robustness [25]. Achieving full availability of
the system is a desired characteristic but is no easy task. Full availability means that a
system can serve a customer at any given point in time. In other words, there will be no
downtime. In recent years, resilience engineering and chaos engineering has changed
how we think about building distributed enterprise solutions. These methodologies in-
doctrinate the developers to continuously develop systems that are robust, scalable, and
safe. However, there are more primitive techniques to improve availability, such as
scaling the system as the traffic increases [25]. Scaling includes increasing the amount
server instances, increasing resource limits, and more [30].
Another method is to use telemetry data to improve the observability of the system [30].
The information from telemetry data provides insights into the current state of the sys-
tem. The insights need to be interpreted correctly in order to know if an availability
degradation will occur. A systems telemetry data could consists of metrics such as: re-
quest per second (RPS), error rate, request latency and request duration [30]. However,
the value provided by the metric depends on the behavior and architecture of the sys-
tem. The Four Golden Signals, RED Method, and USE Method are three generic sets
of metrics to monitor a system and will suffice in many situations [2][3][14].
A maintainer of the system interprets the telemetry data, and it is up to that individ-
ual to determine the state of the system. Human error can occur in the interpretation
of telemetry data, which could lead to availability degradation. Manual monitoring re-
quires extensive human resources as it scales with the system’s size and the number
of instances. Autonomous techniques release human resources, but auto-scaling and
similar preemptive techniques use constant thresholds to determine if a service is being
exhausted. So instead of understanding the data, the maintainer has to understand the
system’s thresholds, thereby leading to a transition of the original human resources.
This thesis investigates the research question: Is it possible for a machine learning model
to predict Kubernetes pod failures based on telemetry data? This prediction could then
warn the maintainer and aid in preventing the predicted pod failure.
At the start of this project, no relevant data existed freely accessible. Simulation of a
10-tier microservice solution over an extensive period to provoke pod crashes resulted
in more than 1270 crashes. The simulation took place in a Kubernetes cluster with the
help of Prometheus, and the telemetry data were constantly stored, consisting of the
metrics presented in Section 5.1.2. This data was then formatted to time windows and

                                            1
1   Introduction

labeled as either crash or healthy data samples.
The project applies two different models, a Random Forest model and a Temporal Con-
volutional Networks model, where the first-mentioned acted as a baseline model. The
Random Forest model has minimal hyperparameter optimization and requires little to
set up, thereby superior for a baseline model. The baseline model’s performance was the
target for the more complex Temporal Convolution Networks model. The input to these
models is a 15 minutes time series of telemetry data with 15 seconds interval between
each metric measurement. The models will then predict if a failure will occur within
the prediction time window. The thesis explores different prediction time windows to
analyze how it impacts the accuracy and precision of the predictions.
As healthy data samples occur throughout the whole day, the collected data is highly
unbalanced. Undersampling rebalanced the data set to a 50/50 balance between the
classes.
The five minutes prediction time window resulted in the best accuracy and precision for
the two models. The Random Forest model achieved an accuracy of 73.4 %, and the
temporal convolutional model achieved an accuracy of 77.7 %. However, the difference
in precision was more noticeable, where the Random Forest model achieved a precision
of 78.9 %, and the Temporal Convolutional Networks model achieved a precision of
96.4 %.

                                           2
2   Background

2     Background

Telia offers enterprise customers a contact center solution, designed for seamless cus-
tomer meetings and intelligent conversations, called ACE. ACE is market-leading in
the Nordic and the Baltic states. This solution allows enterprises to communicate with
their customers over a variety of communication channels. This unique feature gives
the operator of an event a richer picture of the current state as the system aggregate
information from multiple channels to a single source of truth, an omnichannel [8].
To have a competitive product in this industry, it has to provide full availability where the
system is available for a user at any given time. Unavailability may cause the customer
to experience an interruption in the service and the worst-case complete outage of the
service. Therefore, in recent years Telia ACE has started a migration to a cloud-native
microservice architecture to improve the system’s availability and resilience.
The transition to the cloud has provided solutions to update services without downtime
but avoiding failing services is still an issue. Kubernetes with related orchestration
frameworks provides continuous health checks to analyze the state of the services, but
there is still no standard approach to predict a failing service and its cause of failure.

2.1    Cloud Native’s origin

The current understanding of the term cloud-native originates back to 2012 from re-
search and not the industry [27]. There exists no precise definition on what a cloud-
native application (CNA) is. However, there are characteristics for a CNA which are
commonly understood by researchers. A base characteristic is that a CNA must be op-
erated on an automated platform with migration and interoperability in mind. Achiev-
ing the previous mentioned characteristic enables a CNA architecture which consists
of service-based applications. These service-based applications posses the character-
istics such as; horizontal scalability, elasticity, and resiliency. There are pattern based
methods, such as Kubernetes [1], to achieve a CNA architecture. These pattern based
methods are used to automate the process of delivery and infrastructure changes [27].

2.2    Orchestration Framework - Kubernetes

The orchestration framework Kubernetes provides support to run distributed systems
resiliently. Kubernetes offers similar features as Paas (Platform as a Service), such as
deployment, scaling, load balancing, and allows users to integrate logging, monitor-

                                             3
2   Background

ing, and alerting solutions. The distinction from PaaS is that these default solutions
are optional and pluggable. Kubernetes yield the fundamental base for the platform’s
development and preserves user choice and flexibility where it is crucial [1].
A Kubernetes cluster consists of nodes (worker machines) that run the containerized
environment. The nodes host the Pods which are the components of the application
workload. A pod can host multiple containerized applications, and a pod can live across
multiple physical machines, enabling scaling with dynamically changing workload [1].

Figure 1 Illustration of a Kubernetes cluster

2.3    Monitoring in Cloud Environment

Monitoring complex distributed systems play a crucial role in every aspect of a software-
orientated organization [47]. To obtain, store, and process information of a system is a
difficult task. The raw data may not result in any enlightenments, and its value comes
when it is processed. If a system achieves effective monitoring, it can help eliminate
performance bottlenecks, security flaws and aid the engineers in making informed de-
cisions to improve the system. The design of monitoring tools for cloud computing is
yet still under a researched area. There is no standard monitoring technique in cloud
environments. All existing techniques have their benefits and drawbacks [47].
Figure 2 visualizes a three-stage process on how to monitor. The system collects relevant
data from the target’s current state. Another component analyzes the data, produces
visualizations and alerts to the operator, and sends the results to a decision engine that
executes preemptive actions to sustain a healthy environment [47]. This automated

                                            4
2   Background

decision engine can handle faults that it has been engineered to detect but tumbles on
out-of-domain faults. The most common analysis technique in monitoring systems is
threshold analysis. Threshold analysis consists of continuously comparing metric values
with the respective predefined condition. If a metric violates the predefined condition,
the monitoring system will raise an alert. The type of alerts the decision engine can
resolve is primarily trivial issues as more complex issues may have an underlying origin
that is not the analyzed metrics. The most trivial automatic recovery mechanism is to
terminate the faulty virtual machine (VM) and initialize a replacement [47].
The other option, and the most common, is that the engineers will manually respond to
the alert with an informed decision. Finding an appropriate response is a challenging
task. The engineers have to analyze the current state, which requires them to consider
all known states, identify the issue and then form a set of actions to resolve the issue,
which entails significant operations personnel to deal with this process [47].

Figure 2 Monitoring Process

Though monitoring plays a crucial role in maintaining distributed systems, developing
an effective monitoring strategy is extremely difficult - a monitoring strategy state how
to gather and interpret information from the system. The vast majority of monitoring
strategies are devised during the design phase and revised throughout the development
of the system. The monitoring strategy is thus always one step behind as it adapts to
an ever-changing environment. As long as the system changes before the monitoring
strategy, it will always be an insufficient monitoring strategy derived from a system that
no longer exists in its original form. Many high-profile outages were possible due to
monitoring strategies that failed to detect anomalies and thus preventing engineers from
acting preventively to avoid the outage. Monitoring strategies detect only the anomalies
that the engineers have predefined. Thus it suffers from a limited range of anomalies it
can detect [47].

                                            5
2   Background

In recent years, two disciplines have appeared that address this issue. Resilience and
chaos engineering advocate continuous verification of the system to limit and unveil the
system’s anomalies. These two approaches are further explained in sections 2.5 and 2.6.

2.4    Metrics in a Cloud Environment

As mentioned before, choosing metrics is a challenging task, and despite the chosen
strategy, it does not assure the engineers it can detect future issues. However, some
sets of metrics have been popular among monitoring strategies and proven to be useful
in many systems. These are the Four Golden Signals, the RED Method, and the USE
Method.
Google’s SRE teams introduced the four golden signals. The metrics are: latency, traffic,
errors, and saturation [14].

    • Latency is the time it takes to serve a request, from when the client sends the
      request to when the response is received at the client [14].

    • Traffic is a measurement of the current demand for the system. The observed
      metrics depend on what the system’s functionality is. The metric is HTTP usu-
      ally requests per minute. For a streaming system, the measurement might be on
      network I/O rate [14].

    • Errors are the rate of requests that fail. The engineering team needs to define what
      an error is. It could be an HTTP response code or perhaps a policy for requests,
      e.g., requests with latency over 1 second is classified as an error [14].

    • Saturation is a measurement on how full the service is. It provides the information
      on how much load the system can handle. This measurement are usually observ-
      ing the extra work that the service cannot handle, which results in error or delayed
      response [14].

The next set of metrics is RED Method. RED Method takes inspiration from the four
golden signals but excludes saturation. The author of the RED method excludes satu-
ration with the motivation that it only applies to advanced use cases. The three metrics
that the RED Method consists of are rate, errors, and duration. Whereas the rate is
equivalent to traffic, and duration is equivalent to latency. RED Method predefines that
these three metrics observe the HTTP requests and thereby only applicable to request-
driven services. These three metrics combined are sufficient to monitor a vast majority
of services [2].

                                            6
2   Background

USE Method contains the metrics utilization, saturation, and errors. This methodology
analysis the performance of a system. In comparison with the RED Method, which
emphasizes the client experience, the USE method emphasizes the performance of the
service’s underlying hardware [3].

    • Utilization is average time that the service was busy servicing work [3].

2.5    Resilience Engineering

The definition of resilience engineering follows as ”Resilience engineering is a paradigm
for safety management with a focus on how to help people cope with complexity under
pressure to achieve success” stated in the book [48]. In other words, resilience describes
how well the system can manage unpredicted issues. On the other hand, Robustness
refers to system designs that handle predicted issues [10].
Resilience engineering appears in numerous industries in diverse forms, e.g., aviation,
medicine, space flight, nuclear power, and rail. Resilience is critical to have in these
industries to avoid catastrophic failures, or even casualties [10]. However, resilience is
viewed differently within cloud engineering, the mentioned industries require the per-
sonnel to follow strict procedures to prevent issues, but this is uncommon within cloud
engineering. One similar aspect within resilience engineering throughout the industries
is the increased adoption of automation. Automation introduces challenges, and they
are the topics of numerous resilience engineering papers [10].
Resilience engineering stretches not just over systems but also organizations. Resilience
concerns the ability to identify and adapt to manage unpredicted issues with changing
any relevant factor. This factor could be a software change or changes within the orga-
nization as modifications to processes, strategies, and coordination [48].

2.6    Chaos Engineering

Chaos engineering is a relatively new discipline within software development that has
emerged to improve robustness and resilience within systems. This discipline origi-
nates back to Netflix in 2008 when they moved from the data center to the cloud. At
that time, Amazon Web Services (AWS) was considerably less sophisticated than now
[26]. Cloud computing suffered from various defects and failures, such as instances that
would blink out of existence without warning. Therefore, a system had to be resilient
to cope with these failures. The vanishing instances introduced numerous practices to

                                            7
2   Background

deal with it automatically, but Netflix could not adopt these practices due to their unique
management philosophy where the engineering teams were highly aligned and loosely
coupled. There was no mechanism to share an edict to the entire engineering organiza-
tion demanding them to follow these practices [26].
Netflix then introduced Chaos Monkey [26]. Chaos Monkey is an application that shuts
down a random instance without warning, one in each cluster, during business hours,
and the process repeats every day. This feature proactively tested all engineering team’s
resilience, and each team had to apply methods to adapt to these unexpected failures
[26]. After a devastating region failure within AWS that affected Netflix and others,
Netflix introduced Chaos Kong, in response, that disables a whole region, and thus each
engineering team was required to adapt to this kind of failure [26].
The Chaos Engineering team at Netflix created the Principles of Chaos Engineering
which is the fundamental core of Chaos Engineering [26][9]. The definitions of Chaos
Engineering follows ”Chaos Engineering is the discipline of experimenting on a system
to build confidence in the system’s capability to withstand turbulent conditions in pro-
duction”[9]. Chaos Engineering validates that some form of failure is always prone to
be provoked within a system.

                                            8
3   Theory

3     Theory

In the following sections, the theory required to perform the methods of the project is
presented.

3.1    Artificial Neural Network

Neural networks originate back to 1958 where Frank Rosenblatt invented what he called
a Perceptron [35]. The Perceptron is based upon the concept of artificial neurons called
linear threshold unit (LTU) [19]. A neuron comprises several inputs, weights, and an
output, where each input is associated with a weight. The equation how the LTU calcu-
lates the weighted sum of its inputs follows:
                                                              n
                                                              X
                     z = w 1 x1 + w 2 x2 + · · · + w n xn =         w i ⇤ xi
                                                              i=1

The LTU’s output is the weighted sum passed through a step function.
                                                     n
                                                     X
                          hw (x) = step(z) = step(          w i ⇤ xi )
                                                      i=1

See Figure 3 for a graphical overview of an LTU.

Figure 3 Linear Threshold Unit (LTU)

Heaviside step function is the most common step function used in the perceptron.
                                            (
                                              0, if z < 0
                             heaviside(z) =
                                              1, if z 0
In other cases, the sign function is used.
                                       8
                                       >
                                       < 1,         if z < 0
                               sgn(z) = 0,          if z = 0
                                       >
                                       :
                                         1,         if z > 0

                                             9
3   Theory

The linear threshold unit computes a linear relationship between the inputs, and the
perceptron can solve trivial linear binary classification tasks when it utilizes multiple
LTUs [19]. A perceptron consists of two fully connected layers, an input layer and a
layer of LTUs. The input layer consists of input neurons and a bias neuron. The input
neuron forwards the model’s input to the LTUs, and the bias neuron constantly forwards
a 1 to shift the activation function of the LTUs by a constant factor [19]. The LTU layer
consists of multiple units, with weights for the inputs and the bias neuron. See Figure
4 for an overview of the Perceptron. The Perceptron in the figure can classify three
different binary classes based on two inputs.

Figure 4 Perceptron

The algorithm Perceptron training, proposed by Frank Rosenblatt, fits the weights of
the LTUs for a given training instance [35]. The algorithm feds the perceptron with
inputs checks the output compared to the target of the input. The weight is then re-
calibrated based on the difference between the output and the target [19]. The equation
for updating weight follows:

                                    wi,j = ⌘(yj     yˆj )xi

    • wi,j is the weight between ith input neuron and j th output neuron.

    • ⌘ is the learning rate.

    • yˆj is the output of the j th output neuron from the current training instance.

    • xi is the ith input value of the current training instance.

                                            10
3   Theory

    • yj is the target output of the j th output neuron of the current training instance.

Perceptrons are incapable of solving trivial problems such as Exclusive OR (XOR) clas-
sification problem [19]. The Multi-Layer Perceptron (MLP) solves numerous problems
that Perceptrons have. The architecture of the MLP consists of stacking layers on top
of each other. One layer’s output will be another layer’s input. The MLP consists of
one input layer, one or more hidden layers of LTUs, and one final output layer of LTUs.
See Figure 5 for a graphical overview. If the amount of hidden layers exceeds one, the
architecture is called deep neural network (DNN). The MLP is capable of solving more
complex classification problems in comparison with one single Perceptron [19].

Figure 5 Multi-Layer Perceptron

Multi-Layer Perceptron suffered from insufficient training algorithms, which resulted
in bad performance. In 1986, D. E. Rumelhart et al. [36] published introducing the
backpropagation training algorithm. The algorithm measures the network’s output er-
ror. The output error is the difference between the network’s desired output and actual
output. The measurement of the output error consists of computations on how much
each neuron contributed to each output neuron’s error. These computations are recur-
sive, where it begins to compute the output error contribution of each neuron in the last
hidden layer and proceeds back to the input layer. It will establish a measurement of
the error gradient across all weights. It will finally apply a Gradient Descent step with
the previously measured error gradients to tweak the connection weights to reduce the
output error [19].

                                            11
3   Theory

With backpropagation, the step function got replaced by new activation functions. There
is no gradient to work within the step function because it consists only of flat surfaces.
The new activation functions consist of surfaces with gradients, and these functions can
be either linear or non-linear. Two popular activation functions are hyperbolic tangent
function and ReLU function [19].
When a Multi-Linear Perceptrons applies to classification problems, the output layers
activation function is a shared softmax function is. The softmax function transforms the
input values, so the sum of the values is equal to one. Each output value represents the
probability for each class [19].
Loss functions enables neural networks to learn and is used within the backpropaga-
tion algorithm to determine how faulty an output is [37]. The purpose of training net-
works is to find connection weights that lead to the lowest value of the loss function.
In other words, the loss function indicated the poorness of a neural network’s ability.
One common
         P     loss function is the Mean Squared Error (MRE) function which follows:
E = 2 k (yk tk )2 where yk is the output of the network, tk is the labeled data, and
       1

k is the dimensions of the data. The result of the loss function determines how much of
the connection weights to the activated neurons should be adjusted during training [37].

3.2    Convolution Neural Networks

Convolution Neural Networks (CNN) is an advancement of the Multiple-Layer Percep-
tron. The evolution is a result of image analysis. An MLP model for image analysis
would need to have input neurons for each pixel and their dimension, which would re-
sult in large networks, even for small images [32]. For example, a 25x25 RGB image
consists of 625 pixels, and each pixel has three dimensions for the colors red, green, and
blue. The required number of weights for each hidden neuron is 1875, and each hidden
layer consists of 1875 neurons. The number of weight variables increases by the number
of layers as they are fully connected, making deep neural networks unsustainable.
Convolutional neural networks solve this by weight sharing. To enable weight sharing,
CNN utilizes a different architecture than MLP. A CNN model consists of layers. These
layers transform the input and forward the output to the following layer. Common
transformations of a CNN are convolutions, activation layers, pooling layers, linear
layers, and layers that apply regularization techniques [32]. Regularization counteracts
overfitting, and common layers are batch normalization and dropout.
Convolutional layers apply a kernel to the input and output a activation map. The ac-
tivation map is the sums from multiple element-wise multiplications between the input
and the kernel [32]. The kernel slides over the input as seen in Figure 6. The size of

                                           12
3   Theory

the kernel, stride (how far the kernel slides in each step), and padding determine the
shape and receptive field of the output. The receptive field is the input values that are
multiplied with the kernel to obtain the output value. The trainable parameters of a
convolutional layer are the variables of the kernel. The kernel’s variables are shared for
the whole input (weight sharing) [21], thereby making the model size depend on the
architecture and not the input.

Figure 6 The figure shows two element-wise multiplications and summations under
convolution with a 2-dimensional input. The convolution involves an input of shape
5x5 and a 3x3 kernel with no padding. The activation map’s shape is 3x3.

Activation and pooling layers are parameter-free - no optimization applies in the layers.
The layers transform the input by a fixed function that only depends on hyperparameters.
Rectified linear activation and sigmoid activation are two examples of activation layers,
and the transformation is element-wise. Pooling layers reduce the shape of the input
[32]. For example, the max-pooling layer applies a filter that extracts the maximum
value in its receptive field, and a filter slides over the inputs as a kernel. Reducing the
shape of the output counteracts the overfitting of the model [32].
Linear layers are fully connected layers and act similar to hidden layers in MLPs. They
have weights associated with each input, a bias, and an output with a specified shape
[32]. Linear layers can act as a final layer of the model, as its output size is adjustable.
The output size can represent the classes the model should predict.
The input of CNN models are matrices of values. The number of dimensions depends
on the data and if batch learning is applied. An image, for example, would have height,
width, and color spectrum as dimensions, seen in Figure 7. Batch learning increases the
input dimension as it feeds the model with a mini-batch (multiple data samples) at once
and calculates the output in parallel with the help of linear algebra, resulting in improved
training time. The previous example’s dimensions would be batch size, height, width,
color spectrum, seen in Figure 7.

                                            13
3   Theory

Figure 7 Two input examples. To the left in the figure, a three-dimensional image with
5 pixels in height, 5 pixels in width, and RGB as the color spectrum. To the right in the
figure, a four-dimensional input that is n images in a mini-batch.

3.2.1   Temporal Convolutional Networks

Temporal Convolutional Networks (TCN) is an architecture design for sequential mod-
eling. TCN emphasizes flexible receptive field, stable gradients, and parallelism [28,
12]. It distinguishes itself by its convolution is causal, and the model can take an input
of any length and produce an output of the same size [12]. TCN makes use of residual
blocks similar to ResNet [20] with the distinction that each block applies dilated causal
convolution.
Causal convolution is convolution with an input sequence of i0 , . . . , it and an output
y0 , . . . , yt where yi only depends on the input sequence i0 , . . . , ii where 0  i  t.
Causal convolution leads to no data leakage across the time dimension i.e look ahead
bias [46, 12], which happens if standard convolution is applied, the two convolutions
can be seen in Figure 8.

                                            14
3    Theory

Figure 8 To the left: standard convolution with 1 padding. To the right: causal convo-
lution with 2 padding. Both use a kernel with size 3 ⇥ 1 and has input and output size
of 8 ⇥ 1. In the standard convolution, the output at y1 depends on i0 , i1 , i2 , thus leading
to data leakage over the time dimension.

Dilated convolution is convolution with a dynamic receptive field. The receptive field is
dependent on the dilation factor d and kernel size k [46]. Convolution with a dilation
factor d = 1 is equal to standard convolution, and its receptive field is k. When d > 1,
there is a space of d 1 between each input to the kernel, and thus the receptive field is
(k ⇥ d) 1. In Figure 9 dilated convolution occurs with a dilation factor of 2, and each
output yi has a receptive field of 5.

Figure 9 Dilated convolution with k = 3, d = 2, and input and output size of 8 ⇥ 1.

Residual blocks were conceived from the empirical findings that adding more layers to

                                             15
3   Theory

a model leads to degradation of its performance, even if the layers are identity mappings
[20]. The purpose of residual blocks is to learn modifications to the identity mapping
[12] by splitting the input x into an identity branch and a transformation branch. The
identity branch forwards x to the end of the block, and the transformation branch applies
a set of transformations F = {f0 , . . . , fn } to x. If the shape of F (x) and x differentiates,
the identity branch applies a 1 ⇥ 1 convolution to make them compatible when summed
in the output [12, 21]. The branches are joined and passed through an activation function
  , such as the Rectified Linear Unit activation function. The output is y = (x + F (x)),
and thereby the weights of F are updated to minimize the difference between x and y.
Dilated causal convolution in combination with residual blocks makes the TCN model
stable when handling long time series. A common practice for dilated convolution is to
increase the dilation factor exponentially for each layer (block for TCN) [12, 21], and
with the stability of residual blocks, the network’s receptive field can include large time
series.

3.2.2   Transfer learning

Transfer learning is the use of an existing model for a new task. There are three pos-
sible advantages of using a transfer learning model compared to a model trained from
scratch, higher initial performance of the model, lower training time, and higher final
performance [31].
Bozinovski et al. [15] defined the concept of transfer learning in the mid-1970s but
the method’s breakthrough was in recent years as neural networks got deeper and more
expensive to train. Transfer learning is widely adapted in image recognition as it is
favorable with convolution neural networks with a high number of trainable parameters.
During transfer learning, a model M is trained on a task To with data set Do . M consists
of a body and a head. The body is the model architecture, and the head is the last layer(s)
of the model. The head is unique for the data set as its output shape depends on the data
set’s class cardinality.
Model M can be applied on a new task Tn with data set Dn . Before applying the model
to the new task, the head has to be removed, as it predicts the classes of Do and not Dn
[22]. A new layer L with output shape B ⇥ C replaces the head, where B is the batch
size, and C is the cardinality of classes in Dn . In the early stage of training, M ’s body is
frozen, i.e., its parameters are not changed as they are already fit to recognize patterns
in a similar task. The body can be unfrozen when the newly added layer(s) converges to
make the model more coherent to the new task [21].

                                               16
3   Theory

3.3     Decision trees

Decision tree algorithms generate trees by evaluating data sets. A generated decision
tree consists of decision nodes and leaves. Each decision node split the data set and
partition it into descending decision nodes or leaves. Decision trees predict by the label
of the reached leaf when traversing the tree from an input of attributes [34].
The data set S consists of n samples. A sample is a tuple of attributes and a label, and the
data set is detonated by S = {T1 , . . . , Tn } where Ti = {Ai , yi } and A = {a1 , . . . , an }.
The attribute aj ’s type is either qualitative or quantitative [34]. And in classification
prediction, the label yi is the class instance, and in regression, it is the target value.
There are two groups of decision tree algorithms, top-down and bottom-up algorithms.
The most common decision tree algorithm type is top-down and includes ID3, C4.5,
and CART [49]. Top-down algorithms are greedy algorithms and consist of two phases,
growing and pruning. The growing phase builds the decision tree, and the pruning phase
counteracts the downsides of greedy algorithms, where an optimal split only depends
on the information in the node [34]. Pruning decreases the complexity of the tree and
lowers the chance of overfitting.
The growing phase can be generalized as follows, start at the root node with the whole
data set S. Iterate the attributes A of S to find attribute aj that results in the best split
of S, according to the algorithm’s splitting criterion. The split result in m data sets
partitions set and m descending nodes are created for each of these partitions. Repeat
the splitting procedure in the newly created nodes until the node achieves some of the
algorithm’s stopping criteria. Stopped nodes (leaves) are then labeled based on their
partition of the data set. For classification, the label is the class instance with the highest
occurrence, and for regression, it is a statistical measure such as the mean.
The pruning phase traverses the decision tree from the root node and searches for
branches of succeeding nodes that result in a worse distribution of the samples in the
leaves than the upper node. The tree prunes found branches.
The decision tree classifies an attribute set A by traversing the decision tree by its split-
ting rules, and the reached leaf label is the prediction of the sample.

3.3.1   Ensemble learning - Random Forest

Condorcet Jury theorem states that if there is n number of voters and their probability p
of making the correct decision is p > 0.5, then increasing the number of voters leads to
a higher probability that the majority makes the correct prediction. Ensemble learning

                                               17
3   Theory

makes use of this theorem by combining multiple machine learning models to achieve
higher predictive performance. The ensemble decision is obtained by the majority vote
or by the weighted majority vote for classification models. The weighted majority vote
favors selected models’ predictions over others and thus increasing the magnitude of
its vote. If all models have the same errors rate p and the errors is uncorrelated, the
ensemble’s error rate pens can be defined as:

                                    T
                                    X ✓ ◆
                                       T k
                           pens   =       p (1         p)T   k
                                                                                      (1)
                                    T
                                       k
                                   k= 2 +1

T is the number of models, and k denotes the minimum number of models whose pre-
diction has to be true [50].
Random Forest is an ensemble learning method consisting of decision trees. Tin Kam
Ho introduced the Random Forest model in the paper [44] and proposed that the usage
of multiple decision trees could lead to higher predictive accuracy and lower variance.
The paper compared how the splitting rule affected the decision tree’s complexity and
proposed creating decision trees from a random subset of the data sets features [44].
Random feature selection leads to better generalization in the ensemble because deci-
sion trees’ predictions are invariant for samples with variations in the feature excluded
from its feature subset [44], thus making the decision trees more decoupled. Breiman
continued developing Random Forest models and proposed using bagging in combi-
nation with randomness feature selection [17]. Bagging samples the data set for each
decision tree with replacement [16], which decreases the variance of each model without
affecting the ensembles bias.

3.4     Regularization techniques

Regularization techniques are methods to counteract overfitting and improve the gener-
alization of the model. Regularization works by injecting noise or randomness to the
model’s pipeline or weights during training.

3.4.1   Label Smoothing

Label Smoothing is a regularization method that applies to classification models. When
training a classification model, the weights are updated based on the loss. The loss
function depends on the difference between the model’s output v = {v0 , . . . , vK } and

                                             18
3   Theory

the target t = {t0 , . . . , tK }, where K is the number of classes, vK is the activation for
class K, and t is one-hot-encoding for the class of the input. In the loss function, v
passes through a softmax function

                                     e vi
                             (v)i = PK                , if i = 1, . . . , K;
                                         j=1   e zj

that transform v to
                               K
                               X
                                     vi = 1, where vi 2 [0, 1];
                               i=1

The boundary values 0 and 1 is only obtainable if vi ! ±1 for a input of class ti . As
an effect, the model will update its weights to make the output approach infinity and
thereby overfit [43].
Label smoothing solves this issue by modifying the targets the model predicts during
training. The modifications are 0 ! N' and 1 ! 1 N' , where ' states the uncertainty
of a prediction and N the number of classes [21].
When training with label smoothing, the model generalizes as no targets are 0 and 1.
The model will never predict a class by a probability of 0 or 1, and thereby the result of
the softmax function can approach the targets without vi ! ±1.

3.4.2   One cycle training

1cycle training is a regularization technique that divides the training phase into two
phases, warmup and annealing. In the warmup phase, the learning rate r increases, and
the annealing phase r decreases it to the original level. Smith stated in [40] that using
a too high r at the start makes the loss diverge as the random initiated weights have not
converged to the given task. Using a too high r at the end makes the model’s optimiza-
tion method miss local minimums. However, using a higher learning rate throughout
the rest of the process will make the model’s optimization method converge faster.

3.4.3   Dropout

Dropout is a regularization method that randomly drops inputs of the units in the neural
network under training. Dropout multiplies the unit’s input by independent variables
from a Bernoulli distribution, where the variables are one with the probability p. The

                                                19
3   Theory

randomness makes the units produce outputs that are generalized and thereby decreasing
the variance of the model [41]. When testing the neural network, the units’ weights are
multiplied by p to reckon with the output difference from not dropping inputs.
In convolutional neural networks, the architecture consists of layers and not units. Dropout
is available through a dropout layer. The layer drops variables in the input with a prob-
ability p during training. The dropout layer becomes a feature map when it is frozen,
i.e., it only forwards.

3.4.4   Layer normalization

Batch normalization and weight normalization are two regularization techniques for
normalizing the layers in a convolutional neural network.
Batch normalization normalizes the layers’ activation values before the non-linearity
function in a neural network. Batch normalization uses statistical values derived from
the activation values for layer i for all samples in the mini-batch [24]. This method does
not comply with correlated time series as the standard deviation (one of the statistical
values) is only linear when the data is uncorrelated.
Weight normalization takes inspiration from batch normalization without introducing a
dependence between the samples in the mini-batches. Weight normalization improves
the training speed, robustness and thus enables higher learning rates. Weight normaliza-
tion reparameterizes the weight matrices to decouple the weight vector’s norm from its
direction. The optimization method is applied to the new parameters, thereby achieving
a speedup in its convergence [38].

                                           20
4   Related work

4     Related work

The related work consists of papers for predicting failures in hardware and systems,
as no found paper targets Kubernetes pod failures. The following articles inspired the
decisions concerning data representation, how to generate data, and which machine
learning models are suitable for the thesis.

4.1    Predicting Node Failure in Cloud Service Systems

In 2018, Lin et al. [29] published an article that proposed a model for predicting node
failures in a cloud service system. The study discusses how node failures impact a cloud
provider’s availability, and even if the cloud provider has over 99.999 % availability,
the virtual machines suffer from 26 seconds of downtime per month [29]. The paper
presented the model MING, which achieved an accuracy of above 60 % of predicting
a failure in the next day, and as a result of this, they could migrate the affected virtual
machines to healthy nodes and thus reduce the downtime.
The study empathizes the challenge of a highly imbalanced data set, and complex
failure-indicating signals [29]. The data set in the study has an imbalance of 1:1000
between failing and healthy nodes. One conventional method to handle imbalanced
data set is to re-balance the data set with sampling techniques. Re-balancing a data set
could increase the recall of a model, but a potential side effect is decreases precision
due to an increase of false-positive predictions [29]. Therefore, the study proposes a
ranking model to address the imbalanced data set. A ranking model ranks a node’s
failure-proneness and focuses on optimizing the top k result instead of trying to sepa-
rate all true and negative samples [29] that conventional classifiers do. Thereby is less
affected by an imbalanced data set.
As a node can fail in over a hundred different ways [29] the signals indicating a failure
are complex and might have their origin in multiple sources. The study addresses this
problem by including both temporal and spatial data in the model. Temporal data is
time-series data from the individual node such as performance counters, logs, sensor
data, and OS events [29]. Spatial data includes a node’s relation to other nodes, such as
shared routers, rack location, resource family, load balance group, and more [29]. The
temporal and spatial data are the inputs to the model’s sub-models, a Long Short Time
Memory model and a Random Forest model. The two models’ output is concatenated
and fed to the ranking model, which generates the output of the whole model. In con-
clusion, a model consisting of sub-models that handles data from different sources and
uses a ranking model for output performs overall better than individual models [29].

                                            21
4   Related work

4.2    Failure Prediction in Hardware Systems

In 2003, Doug Turnbull and Neil Alldrin published an article [45] that proposed a clas-
sification model that predicts hardware failures. The effect of hardware failures could
be fatal for a vendor. It could lead to high expenses related to damage to equipment,
loss of confidence of clients, violation of service agreement between vendor and client,
and loss of mission-critical data[23].
The goal of the model was to obtain a high true-positive rate (tpr) and a low false-
positive rate. A high tpr could assist the vendor and potentially prevent failures. A low
fpr is vital as the occurrences of failures are rare, and false-positive predictions could
initiate expensive events by the vendor to handle the failure that does not exist.
The article defines two abstractions for the data, the Sensor Window, and the Potential
Failure Window. The Sensor Window is the input of the model, and the Potential Failure
Window is the target of the model. A Sensor Window consists of n entries, and each
entry is an aggregation of raw or processed data from the sensors. The Potential Failure
Window would consist of n entries and each entry state if a failure occurred in the
hardware during its time period. The number of entries for each window sets its time
period, and each entry represents one minute. The two windows combined are either
a positive feature vector - if a failure occurs or a negative feature vector - if no failure
occurs.
The article concludes with a benchmark of the model with different data sets. The data
set varies by how many entries the sensor window consists of and if the sensor data is
raw or processed. The model achieves the best result by a sensor window of ten minutes
of processed sensor data.

4.3    System-level hardware failure prediction using deep learn-
       ing

Sun et al. [42] conducted a study comparing different machine learning models to pre-
dict disk and memory failures based on safety parameters. The study developed tech-
niques to normalize the data as an effect of lack of standard on how to log the attributes,
to train a neural network with an imbalanced data set as a hardware failure is a rare event
and a base model for transfer learning that is trained on the normalized data set and then
fine-tuned on data from a specific manufacturer.
The ”Self-Monitoring Analysis And Reporting Technology” (SMART) is a standard that
requires disk manufacturers to record specific attributes throughout a disk’s lifetime.

                                            22
4   Related work

However, the standard does not state how manufacturers should record the attributes.
Thereby data from different manufactures cannot be combined into a data set without
normalization. The study normalizes the data to a unified distribution based on the
historical data on both healthy and failed samples. The result is a generic data set
independent of the manufacturer.
The study creates a loss function for imbalanced data sets. The loss function is an exten-
sion of the binary cross-entropy loss function, reduces the loss for correct predictions,
and magnifies the loss for misclassified inputs [42]. Their extension of binary cross-
entropy is to multiply the loss with a coefficient which value is similar to a sign function
and depends on the classification of the input [42].
The study provides two benchmarks: first, how different models trained on the normal-
ized data set perform, and second, how a transfer learning model performs on a specific
manufacture’s data set. The transfer learning model trains on a normalized data set and
then fine-tuned on the target manufacture data set.
The study concludes that a temporal convolutional neural network achieves the best
overall performance [42]. The model achieves it when it trains on the normalized data
set and then fine-tuned on a specific manufacturer’s data set. The model uses the created
loss function for both training periods.

4.4    Predicting Software Anomalies using Machine Learning
       Techniques

This paper introduces a detailed evaluation of a set of machine learning classifiers to
predict dynamic and non-deterministic software anomalies [11]. The predictions are
based upon monitoring metrics, similar to this thesis. The conducted machine learn-
ing methods that were evaluated in this paper were; Rpart(Decision Tree), Naive Bayes
(NB), Support Vector Machines Classifiers (SVM-C), K-nearest neighbors (knn), Ran-
dom Forest (RF), and LDA/QDA (Linear and quadratic discriminant analyses).
The study conducted three scenarios with different failure injections that resulted in
three separate data sets, which will be used to train the models. The system setup, which
was used to provoke the failures, is a multi-tier e-commerce site that simulates an online
book store, written in Java with and MySQL database. Apache Tomcat was utilized as
the application server. The parameters that was altered during the simulations were the
threads and memory, individually or in combination. Three different definitions were
used to classify the current state of the system [11].

                                            23
4   Related work

    • Red zone: the system is at least 5 minutes away from a crash.

    • Orange zone: the system is 5 minutes distance to the red zone.

    • Green zone: the rest of the timeline.

These are the labels of the time windows that have been used in the data set, thus the
targets of the models. The paper concluded that the Random Forest algorithm had the
lowest validation error rate (less than 1%) in all three scenarios [11].

4.5    Netflix ChAP - The Chaos Automation Platform

Netflix conducts several deployments a day, and each change affects the resilience of
the system [13]. Therefore, relying on previous experiments to verify the service’s
resilience in an ever-changing environment is not a good strategy. Furthermore, from
this, the Chaos Automation Platform was born. ChAP is a continuous verification tool
that is integrated with Netflix’s continuous integration and deployment pipeline [13] it
also complies with the advanced principles of chaos engineering [26][9].
ChAP conducts a chaos experiment by creating a new cluster of the service in scope. To
retain availability, it creates hundreds of instances of that service [13]. The experiment
allocates two instances that the load balancer will redirect one percent of the requests to
those instances. Instance one will be exposed to the chaos experiment, and the second
will act as a reference to the steady-state of the service [26]. After a completed experi-
ment, ChAP informs the service owner about its performance. The report may contain
required changes to improve the quality of the service.

                                              24
5   Methodology

5       Methodology

This project touches a wide range of field in computer science. In the following sections,
the implementation procedures will be defined and described.

5.1     AWS Cluster

Data is generated by exposing a demo application for different levels of load over time.
For this project, the demo application has been deployed on a Kubernetes cluster with
the following services, Prometheus, Istio, and Grafana. These services provide observ-
ability and inter-communication abstraction for the application.
The cluster is deployed on Amazon Web Services and has the resources of 24 CPU cores
and 91 GiB memory.

5.1.1    Boutique - Sample Application By Google

Online Boutique is Google’s show-case application for cloud-native technologies [4].
It is a 10-tier microservice system written in Python, Golang, Node.js, Java, and C#
and utilizes technologies such as Kubernetes, Istio, gRPC, and OpenCensus. Boutique
mimics a modern e-commerce site and includes a load generator written in Locust.
The Boutique is treated as a black box during the generation of data, and crashes are
initiated by increasing the load or injecting latency in the system and not by exploiting
the weaknesses of the system. See Figure 10 for an overview of the services. See Figure
11 for an overview of the pods and container of each service. The server is the container
where the application is stored.

                                           25
You can also read