Head Pose Estimation using Deep Learning

 
Head Pose Estimation using Deep Learning
Head Pose Estimation
           using Deep Learning

                          Master’s thesis

                  in fulfillment of the requirements
                          for the degree of

                        Master of Science

                in Computing in the Humanities
Faculty of Information Systems and Applied Computer Sciences
                      University of Bamberg

 Author:                                                 Supervisor:
 Ines Sophia Rieger                           Prof. Dr. Ute Schmid
 (Matr. No. 1838490)

                        in cooperation with
        Fraunhofer-Institute for Integrated Circuits IIS
                   Group for Intelligent Systems
     supervised by Thomas Hauenstein and Sebastian Hettenkofer

                      Bamberg, April 16, 2018
Head Pose Estimation using Deep Learning
Abstract
Head poses are an important mean of non-verbal human communication and thus a
crucial element in human-computer interaction. While computational systems have
been trained with various methods for head pose estimation in the recent years, ap-
proaches based on convolutional neural networks (CNNs) for image processing have
so far proven to be one of the most promising ones. This master’s thesis starts of
improving head pose estimation by reimplementing a recent CNN approach based on
the shallow LeNet-5. As a new approach in head pose estimation, this thesis focuses
on residual networks (ResNets), a subgroup of CNNs specifically optimized for very
deep networks. To train and test the approaches, the Annotated Facial Landmarks in
the Wild (AFLW) dataset and the Annotated Faces in the Wild (AFW) benchmark
dataset were used. The performance of the reimplemented network and the imple-
mented ResNets of various architectures were evaluated on the AFLW dataset. The
performance is hereby measured in mean absolute error and accuracy. Furthermore,
the ResNets with a depth of 18 layers were tested on the AFW dataset. The best
performance of all implemented ResNets was achieved by the 18 layer ResNet adapted
for an input size of 112 x 112 pixels. In comparison with the reimplemented network
and other state-of-the-art approaches, the best ResNet performs equal or better on the
AFLW dataset and outperforms on the AFW dataset.
Head Pose Estimation using Deep Learning
CONTENTS

Contents
List of Abbreviations                                                                       iv

List of Figures                                                                              v

List of Tables                                                                             viii

1 Introduction                                                                               1

2 Head Pose Estimation                                                                       3
   2.1   Representation of Head Poses by Euler Angles . . . . . . . . . . . . . .            3
   2.2   Head Pose Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        5
   2.3   Methods for Head Pose Estimation . . . . . . . . . . . . . . . . . . . .           11

3 Convolutional Neural Networks (CNNs)                                                      19
   3.1   Training with CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . .         19
         3.1.1   Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . .        20
         3.1.2   Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . .        23
   3.2   Pre-Processing Image Data . . . . . . . . . . . . . . . . . . . . . . . . .        28
   3.3   Regularization Measures . . . . . . . . . . . . . . . . . . . . . . . . . .        30
   3.4   Residual Networks (ResNets) . . . . . . . . . . . . . . . . . . . . . . .          33

4 Experiments                                                                               39
   4.1   Pre-Processing    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
   4.2   Evaluation Methods      . . . . . . . . . . . . . . . . . . . . . . . . . . . .    41
   4.3   Reimplementation of Patacchiola and Cangelosi [49] . . . . . . . . . . .           43
         4.3.1   Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      44
         4.3.2   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    47
   4.4   Implementation of ResNets . . . . . . . . . . . . . . . . . . . . . . . . .        48
         4.4.1   Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       49
         4.4.2   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    53

5 Comparison of Approaches                                                                  61
   5.1   Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .        61
Head Pose Estimation using Deep Learning
CONTENTS

  5.2   Comparison of the Number of Trainable Variable Parameters . . . . . .   64

6 Conclusion                                                                    65

References                                                                      69

A Dataset Histograms                                                            77

B System Specifications                                                         85

C Training Loss                                                                 86
Head Pose Estimation using Deep Learning
LIST OF ABBREVIATIONS

List of Abbreviations
   AI    Artificial Intelligence
 AFLW    Annotated Facial Landmarks in the Wild (dataset)
 AFW     Annotated Faces in the Wild (dataset)
   BN    Batch Normalization
  CNN    Convolutional Neural Network
H-CNN    Heatmap-Convolutional Neural Network
  LRN    Local Response Normalization
  MAE    Mean Absolute Error
  MLP    Multilayer Perceptron
  PCA    Principle Component Analysis
  POS    Pose from Orthography and Scaling
 RELU    Rectified Linear Unit
ResNet   Residual Network
   SD    Stochastic Descent
  SGD    Stochastic Gradient Descent
  SOP    Scaled Orthographic Projection
Head Pose Estimation using Deep Learning
LIST OF FIGURES

List of Figures
  1    Tait-Bryan angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      4
  2    Head orientation with yaw, pitch and roll angle [3] . . . . . . . . . . . .        4
  3    Example of cropped and scaled Prima dataset images . . . . . . . . . .             6
  4    Prima dataset measurement setting . . . . . . . . . . . . . . . . . . . .          7
  5    Examples of cropped and scaled AFLW dataset images in grayscale . .                9
  7    Overview of head pose methods . . . . . . . . . . . . . . . . . . . . . .         12
  8    Standard CNN architecture [36] . . . . . . . . . . . . . . . . . . . . . .        20
  9    Representation of convolutional layer during forward propagation . . .            21
  10   Hyperbolic tangent function . . . . . . . . . . . . . . . . . . . . . . . .       22
  11   Representation of max pooling during forward propagation . . . . . . .            23
  12   Representation of a gate embedded in a circuit during backpropagation             24
  13   Representation of convolutional layer during forward propagation . . .            25
  14   Derivative of hyperbolic tangent function . . . . . . . . . . . . . . . . .       27
  15   Representation of dropout in a neural network with two hidden layers [61] 31
  16   Training and testing error in plain networks [25] . . . . . . . . . . . . .       34
  18   ResNets of different depths [25] . . . . . . . . . . . . . . . . . . . . . .      36
  19   Left: original, right: pre-activated residual block [26] . . . . . . . . . .      37
  20   Architecture of reimplemented network . . . . . . . . . . . . . . . . . .         45
  21   Training losses of five-fold cross-validation on AFLW-64 dataset, reim-
       plementation of Patacchiola and Cangelosi [49], yaw angle . . . . . . .           48
  22   Implemented residual building block       . . . . . . . . . . . . . . . . . . .   50
  23   Implemented ResNet18 . . . . . . . . . . . . . . . . . . . . . . . . . . .        50
  24   Confusion matrix of the ResNet18-112 as heatmap, yaw angle . . . . .              55
  25   Confusion matrix of the ResNet18-112 as heatmap, pitch angle . . . . .            55
  26   Confusion matrix of the ResNet18-112 as heatmap, roll angle . . . . . .           56
  27   Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18-
       64, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     57
  28   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18-
       112, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    57
Head Pose Estimation using Deep Learning
LIST OF FIGURES

29   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34-
     112, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   58
30   Confusion matrix of the ResNet18-112 as heatmap, tested on the AFW-
     112 dataset, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . .     59
31   Training loss on entire AFLW-64 dataset, ResNet18-64 . . . . . . . . .           60
32   Training loss on entire AFLW-112 dataset, ResNet18-112 . . . . . . . .           60
33   AFLW histogram, yaw angle with entire label range of −125◦ to 168◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        77
34   AFLW histogram, pitch angle with entire label range of ±90◦ , with
     plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . .       78
35   AFLW histogram, pitch angle with entire label range of −178◦ to 179◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        78
36   AFLW-64 histogram, yaw angle with restricted label range of ±100◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        79
37   AFLW-64 histogram, pitch angle with restricted label range of ±45◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        79
38   AFLW-64 histogram, roll angle with restricted label range of ±25◦ , with
     plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . .       80
39   AFLW-112 histogram, yaw angle with restricted label range of ±100◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        80
40   AFLW-112 histogram, pitch angle with restricted label range of ±45◦ ,
     with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . .        81
41   AFLW-64 histogram, roll angle with restricted label range of ±25◦ , with
     plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . .       81
42   AFW histogram, yaw angle with entire label range of −105◦ to 90◦           . .   82
43   AFW-64 histogram, yaw angle with restricted label range of ±100◦ . .             82
44   AFW-112 histogram, yaw angle with restricted label range of ±100◦ . .            83
45   AFLW-112 histograms with training data distribution of the five-fold
     cross-validation for ResNet18-112 . . . . . . . . . . . . . . . . . . . . .      84
46   Training losses of five-fold cross-validation on AFLW-64 dataset, reim-
     plementation of Patacchiola and Cangelosi [49], pitch angle . . . . . . .        86
Head Pose Estimation using Deep Learning
LIST OF FIGURES

47   Training losses of five-fold cross-validation on AFLW-64 dataset, reim-
     plementation of Patacchiola and Cangelosi [49], roll angle . . . . . . . .         87
48   Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18-
     64, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    87
49   Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18-
     64, roll angle   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   88
50   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18-
     112, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     88
51   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18-
     112, roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    89
52   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34-
     112, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     89
53   Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34-
     112, roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    90
Head Pose Estimation using Deep Learning
LIST OF TABLES

List of Tables
  1    Overview of head pose datasets . . . . . . . . . . . . . . . . . . . . . .          5
  2    Comparison of human performance to a nonlinear regression approach
       with CNNs on the Prima dataset, results are the MAEs . . . . . . . . .             13
  3    Overview of recent CNN approaches on the AFLW Dataset, results are
       the MAEs, sorted by the yaw angle . . . . . . . . . . . . . . . . . . . .          18
  4    Input datasets with a restricted label range: yaw (±100◦ ), pitch (±45◦ ),
       roll (±25◦ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   40
  5    Size of the AFLW dataset during training and testing . . . . . . . . . .           42
  6    Five-fold cross-validation: mean and standard deviation of the AFLW-
       112 training datasets, yaw angle . . . . . . . . . . . . . . . . . . . . . .       43
  7    Parameters of reimplemented network . . . . . . . . . . . . . . . . . . .          46
  8    Results of reimplementation and original approach [49] . . . . . . . . .           47
  9    Parameters of ResNet implementation for convolutional layers . . . . .             51
  10   Parameters of ResNet implementation, see also Table 9 . . . . . . . . .            52
  11   Results of ResNets tested on the AFLW-64 and AFLW-112 datasets . .                 54
  12   Results of ResNets tested on AFW dataset . . . . . . . . . . . . . . . .           58
  13   Results of the ResNets with 18 layers and of the approach of Patacchiola
       and Cangelosi [49] on the AFLW and AFW datasets, results are the
       MAEs, sorted by result on the AFW dataset . . . . . . . . . . . . . . .            62
  14   Comparison of results achieved by different methods on the AFLW and
       AFW dataset, results are the MAEs, sorted by result on the AFW dataset 63
  15   Results of the ResNets and the pre-trained networks on the AFLW
       dataset, results are the MAEs, sorted by result of the yaw angle . . . .           63
  16   Results of the self-implemented networks on the AFLW dataset in MAE
       and the number of trainable variable parameters, sorted by parameter
       number     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   64
  17   System specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . .      85
Head Pose Estimation using Deep Learning
1     Introduction
Head poses are a key element of human bodily communication. When humans interact,
a lot of the communication happens non-verbally through gestures, facial expressions,
or gazes. Head poses are an integral part of gestures that can serve to give content-
related feedback, indicate the focus of attention, or express emotions. A common
content-related feedback is the head-nodding or shaking, which is, depending on the
cultural context, usually interpreted as ”yes” or ”no” [4, p. 111]. One can also deliver
spacial information by indicating the location of objects with the head’s orientation.
The focus of attention can be revealed by the person’s head orientation [5] or by their
gaze [30]. Furthermore, head poses help to interpret emotions. Shame for example, is
often expressed by a lowered head and an averted gaze [30, p. 417].
Since head poses are a core principle of non-verbal human communication, they are
also important in various contexts of human-computer interaction: In human-robot
interaction, multimodal humanoid robots, which are for example used in domestic en-
vironments [73], are not only trained for abilities like speech recognition or face- and
hand-tracking, but also for head pose estimation to provide a natural interaction with
its users [65]. In the context of driver assistance systems, one of the common use cases
of head pose estimation is monitoring the driver’s field of view: By observing the head
pose, the system can estimate the driver’s level of attention and encourage him to keep
his eyes on the road again [34]. Driver assistance systems also monitor the surrounding
pedestrians’ head poses regarding their focus of attention. Together with the detection
of their position around the car, this helps to avoid collisions [18, 6]. In the field of
behavioral studies, head pose estimation allows systems to monitor social interactions
[6, 10], detect social groups via surveillance cameras [35], and observe a person’s target
of interest [38]. In combination with the orientation of the eyes, head poses serve as
gaze prediction [46, 80, 16].
When it comes to estimating head poses with computational systems, one of the most
promising methods are convolutional neural networks (CNNs) [49]. CNNs are a spe-
cialized kind of feed-forward neural network for deep learning in the field of machine
learning [20], applied for processing images, videos, speech or audio [36]. CNNs are
successfully used for object recognition (i. e. classification of objects) in images, but yet

                                             1
there are only few approaches that use them to estimate head poses (Sec. 2.3). One of
the best approaches using CNNs for head pose estimation is from Patacchiola and Can-
gelosi [49], who train networks of various depths on in-the-wild datasets. In-the-wild
datasets ensure real-world applicability and thus, the following datasets consisting of
images from the image hosting website Flickr1 are used for training and testing: The
Annotated Facial Landmarks in the Wild (AFLW) dataset and the Annotated Faces
in the Wild (AFW) benchmark dataset. As a starting point for the experiments, one
of Patacchiola and Cangelosi’s [49] networks, the one performing best on the AFLW
dataset, is reimplemented and evaluated on the AFLW dataset. Because Patacchiola
and Cangelosi [49] examined shallow CNNs based on the LeNet-5 [37], this thesis ex-
plores the deeper residual networks (ResNets) with the following research questions:

(1) How do ResNets of different depths perform on images with differ-
ent resolutions of the AFLW and AFW datasets?
(2) How do the implemented ResNets perform in comparison with the reim-
plemented network based on the LeNet-5?

For the reimplemented network and the ResNets, the same pre-processing and eval-
uation methods are applied. The performance is measured in mean absolute error
(MAE) and accuracy. Additionally, the number of parameters is compared for the self-
implemented networks. In the conducted experiments, ResNets of various depths and
adapted for different input sizes are implemented and then evaluated on the AFLW
dataset with a five-fold cross-validation. Furthermore, ResNets with a depth of 18
layers are trained on the entire AFLW dataset and then tested on the AFW dataset.
The results of the ResNets are further compared to other competing approaches.
The thesis is organized as follows. In Chapter 2, fundamental background informa-
tion about head pose estimation, i. e. representation of head poses by Euler Angles,
most-used datasets and an overview of head pose estimation methods, is introduced.
Chapter 3 explains the training of CNNs including pre-processing and regularization
methods as well as the functionality of ResNets. The realization and results of the con-
  1
      https://www.flickr.com/, accessed 16.04.2018.

                                               2
ducted experiments are described in Chapter 4, followed by a comparison of approaches
in Chapter 5, and the conclusion in Chapter 6.

2     Head Pose Estimation
This chapter provides relevant background information on head pose estimation of
computational systems and is outlined as follows: First, the notation of the Euler
Angles is explained and the most used datasets are introduced. Then, an overview of
head pose estimation methods including a reference to the human performance is given.

2.1    Representation of Head Poses by Euler Angles

Euler Angles generally measure the orientation of a rigid body in a fixed coordinate
system [13], whereby the Tait-Bryan angle notation, a notation form of the Euler Angles
commonly used in the aerospace context, is applied to formally define head poses. In
the Tait-Bryan notation, the three angles to describe the objects’ pose are called yaw,
pitch and roll, commonly represented by ψ, θ and φ as in Figure 1. These three angles
can be defined by a rotation sequence of three elemental rotations. Figure 1 shows the
status of a rotated coordinate system (red: X, Y, Z) after a common intrinsic rotation
sequence of Z −Y 0 −X [72]. The axes x, y, and z of the blue coordinate system thereby
remain fixed as a reference coordinate system. An intrinsic rotation is about the local
axes, which are at the geometric center of the object. Thus, the object rotates about
the axes of the rotating system, in this sequence first about the Z-axis, then about the
former Y -axis, now N (y 0 ), and lastly about the X-axis, thereby changing the axes of
the system themselves after each rotation. While the green axis N (y 0 ) represents the
position of the Y axis after the rotation about the Z axis, the green axis N ⊥ represents
the X-axis after the rotation about the N (y 0 )-axis. After the three elemental rotations,
the yaw angle ψ is between the y-axis and the N (y 0 )-axis, the pitch angle θ between
the N ⊥ -axis and the current X-axis and the roll angle φ between the N (y 0 )-axis and
the current Y -axis. Figure 2 depicts the head as an intrinsically rotated object with
orientations of the yaw, pitch and roll angles.

                                            3
Figure 1: Tait-Bryan angles   2

                 Figure 2: Head orientation with yaw, pitch and roll angle [3]

Ferrario et al.’s [15] research shows the average head movement range of healthy young
adults. The mean ranges calculated from the data of thirty men and thirty women are
the following:

   • yaw angle: −79.8◦ to +75.3◦

   • pitch angle: −60.4◦ to +69.6◦

   • roll angle: −40.9◦ to +36.3◦
  2
      Titel: Taitbrianzyx.svg, Author: Juansempere, Source: https://commons.wikimedia.org/
wiki/File:Taitbrianzyx.svg, accessed 25.02.2018, Licence: Creative Commons Attribution 3.0 Un-
ported Licence (https://creativecommons.org/licenses/by/3.0/deed.en, accessed 25.02.2018).

                                              4
The numbers indicate that the flexibility of the head seems to vary depending on the
direction of the movement. Since the yaw angle has the widest range, it is often used
to compare various head pose estimation approaches.

2.2    Head Pose Datasets

The quality of the dataset is one of the most critical aspects when training deep neural
networks. In the following, the most recent used datasets for head pose estimation
with CNNs are described. Table 1 shows a summarized overview of these datasets.
Since the trained systems should be robust enough for real-life situations, the trend is
to use real-life images for training and testing. Two of the described datasets (AFLW
and AFW), the ones used in this thesis’ approach, are of such nature and thus called
in-the-wild datasets.

                           Table 1: Overview of head pose datasets

 Name                   Yaw         Pitch            Roll       Number     Annotation
                                                                of Faces   Process
 Prima                  ±90◦ ,      ±90◦ ,          not anno-   2,790      subjects
                        15◦ steps   15◦ steps       tated                  looked at de-
                                                                           gree markers

 Biwi Kinect            ±75◦ ,      ±60◦ ,          not anno-   15,678     faceshift soft-
                        cont.       cont.           tated                  ware
                        values      values

 Annotated Fa-          −125◦       ±90◦ ,          −178◦       25,993     POSIT algo-
 cial Landmarks         to 168◦ ,   cont.           to 179◦ ,              rithm (manu-
 in the Wild            cont.       values          cont.                  ally annotated
 (AFLW)                 values                      values                 landmarks)

 Annotated              −105◦       −45◦            −15◦        468        manually an-
 Faces in the           to 90◦ ,    to 30◦ ,        to 15◦ ,               notated
 Wild (AFW)             15◦ steps   15◦ steps       15◦ steps

                                                5
Prima dataset [21] The Prima dataset consists of 2,790 monocular facial images
extracted from videos. The dataset is downloadable online and can be used for any
purpose, provided a source reference is given.3 Face coordinates for each image are
stored in an extra text file. The dataset consists of 15 subjects of ages 20 to 40. Five
subjects have facial hair and seven subjects wear glasses. Every image is annotated
with the angles yaw and pitch in the range of ±90◦ . The yaw angle is annotated
throughout the range in 15◦ steps. The pitch angle’s annotation is split: The range
±30◦ is annotated in 15◦ steps, while the rest of the range is annotated in 30◦ steps.
There are no images where the pitch angle is +90◦ or −90◦ , except in cases where
the yaw angle is 0◦ . Consequently, there are 93 head poses available for each person.
Figure 3 shows example images of one person with a yaw angle ranging from 0◦ to 90◦
in 15◦ steps and a constant pitch angle of 0◦ .

                        (a)            (b)              (c)         (d)

                                 (e)             (f )         (g)

               Figure 3: Example of cropped and scaled Prima dataset images

A second series with different lighting makes it possible to test on known and unknown
faces.4 For testing on known faces, a two-fold cross-validation is suggested: One fold
contains the original images, while the other fold contains the images with changed
lighting. For testing on unknown faces, the Jack-Knife (also called Leave One Out)
algorithm is suggested: all images of one person are completely left out for training
and used for testing only. Thus, for this dataset the Jack-Knife algorithm sums up to
   3
       http://www-prima.inrialpes.fr/perso/Gourier/Faces/HPDatabase.html,
accessed 22.01.2018.
   4
     http://www-prima.inrialpes.fr/perso/Gourier/Faces/HPDatabase.html,
accessed 22.01.2018.

                                             6
15 training iterations in total.
The measurement of the different poses in the Prima dataset was achieved through
the participants successively focusing on distributed post-its in the room that marked
the different yaw and pitch angles in 15◦ steps (Fig. 4). The participants were filmed
in front of a neutral background from a distance of two meters and were asked not
to move their eyes. Focusing on fixating one marker after the other in a very exact
manner is a challenging task, but there are no mentions of further evaluation methods
to confirm the accuracy of the directional measurement. Thus, it can be assumed that
the dataset is error-prone with presumably a bad uniformity between different subjects
in the same pose.

              (a) Yaw markers                            (b) Pitch markers

                      Figure 4: Prima dataset measurement setting

The Prima dataset is used as a benchmark in various head pose estimation approaches
[71, 76, 50, 22]. In addition, the results of persons estimating the head poses of this
dataset [22] are provided in Section 2.3.

Biwi Kinect dataset [14] This dataset contains 15,678 images of 20 people with a
yaw range of ±75◦ and a pitch range of ±60◦ . The dataset can be downloaded freely

                                            7
for research purposes.5 The images provide depth and RGB data. The people were
filmed while turning their head freely in the yaw and pitch angles.
The frames of the video were annotated with the yaw and pitch angles by using the au-
tomatic system faceshift.6 Faceshift is a facial motion capture software. It can capture
and describe a persons facial movement, head pose and eye gaze. This information is
for example used to animate virtual characters in movies and games.
The Biwi Kinect database is often used in approaches, which consider depth data
[44, 75]. Some of the approaches are covered in Section 2.3.

Annotated Facial Landmarks in the Wild (AFLW) dataset [31] The focus
of the AFLW dataset is to provide a large variety of different faces (ethnicity, pose,
expression, age, gender, occlusion) in front of natural backgrounds and lighting condi-
tions. The images in the dataset were extracted from the image hosting website Flickr.7
The dataset contains 25,993 annotated faces in 21,997 images (Fig. 5). The 25,993
faces are annotated with up to 21 facial landmarks (Fig. 6) and head pose informa-
tions. The facial landmarks were marked manually upon visibility. Face coordinates
for cutting out the faces are provided. 56% of the faces are tagged as female and 44%
are tagged as male. Koestinger et al. [31] state that the rate of non-frontal faces of
66% is higher than in any other dataset. The dataset is well suited for multi-view face
detection, facial landmark localization and head pose estimation. The metadata for
the images are stored in a SQL database. For downloading the dataset and database
a registration via email is required.8
 The distribution of poses of the AFLW dataset is not uniform with very few images
for the lower and higher degrees. The label range for the yaw angle is from −125.1◦
to 168.0◦ , for the pitch angle ±90◦ and for the roll angle from −178.2◦ to 179.0◦ . The
mean of the yaw (1.9◦ ) and roll (1.0◦ ) angles are close to 0◦ , while the mean of the
   5
       https://data.vision.ee.ethz.ch/cvl/gfanelli/head_pose/head_forest.html,
accessed 02.02.2018.
   6
     http://faceshift.com/studio/2015.2/, accessed 07.04.2018.
   7
     https://www.flickr.com/, accessed 07.04.2018.
   8
     https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/,
accessed 22.01.2018.

                                             8
(a)           (b)            (c)           (d)            (e)           (f )

      Figure 5: Examples of cropped and scaled AFLW dataset images in grayscale

                 Figure 6: Side and frontal view of facial landmarks [31]

pitch angle (−8.1◦ ) shows that this angle’s distribution is shifted to the left side. The
yaw angle has a standard deviation of ±41.8◦ , the pitch angle of ±13.4◦ and the roll
angle of ±14.0◦ . Figures 33, 34 and 35 in Appendix A depict the distribution of the
yaw, pitch and roll angle in histograms.
The head poses of the AFLW dataset were computed with the POSIT algorithm. This
algorithm takes as input an initial 3D-model and 2D images annotated with facial land-
marks. Koestinger et al. [31] use a 3D mean model [66] of the front face as the initial
3D-model. For pose estimation the facial landmarks of the 2D-model are fitted onto
the 3D-model and the error between the 3D-model points and the 2D image points is
minimized with the help of the POSIT algorithm [12]. The POSIT algorithm needs at
least four non-coplanar reference points on the 3D-model, their corresponding points
on the 2D-model and the focal length of the camera as an input. The POSIT algorithm
itself consists of two steps. The first step is called Pose from Orthography and Scaling
(POS) and computes the approximate scaled projection and the scaled orthographic
projection (SOP). The rotation matrix and the translation matrix are computed out of
the linear system. The two rows of the rotation matrix, i and j, and the z-coordinate of
the translation object provide the angles of the pose in continuous values. The second

                                            9
step of the algorithm consists of a few iterations to improve the approximate pose of
the first step. The algorithm computes the next SOPs out of the pose of the previous
step. Consequently, the feature points are shifted closer towards the correct position
every time. These new SOPs are then used again as input to the POS algorithm and
so on. The algorithm usually ends after four or five iterations.
Koestinger et al. [31] note in their paper that the resulting poses from the POSIT
algorithm are not manually verified and that they see the poses rather as a rough
estimation. The angle distribution of the dataset could be an indicator for errors in
the pose annotation: As described in this section, the AFLW dataset has extremely
wide ranges for all angles, which supersede by far the average head movements [15]
described in Section 2.1. Thus, the poses near the borders of the ranges are no realistic
head orientations. It is pointed out on the website9 that the dataset is continuously
improved. Via mail inquiry the administrator provided the information that the image
annotation has had a major improvement at the beginning of 2012. Unfortunately the
administrator did not have any further information on the improvement and to the
best of my knowledge no further papers were released on that matter.
The AFLW dataset is often used in combination with the AFW dataset as a benchmark
in the challenging field of head pose estimation in the wild [49, 33, 55, 67, 27, 70, 1, 82].
This dataset is also used for the following experiments (Ch. 4).

Annotated Faces in the Wild (AFW) dataset [82] The AFW dataset has simi-
lar features as the AFLW dataset, since these images are also extracted from the image
hosting website Flickr.10 It was proposed by Zhu and Ramanan to validate their face
detection and head pose estimation methods with natural images [82]. The download
of the dataset includes the code and proposed model of Zhu and Ramanan.11 The
AFW dataset shows a wide variety of ethnicity, pose, expression, age, gender and oc-
clusion. The faces are positioned in front of natural cluttered backgrounds. There are
  9
      https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/,
accessed 22.01.2018.
  10
     https://www.flickr.com/, accessed 07.04.2018.
  11
     https://www.ics.uci.edu/~xzhu/face/, accessed 22.01.2018.

                                             10
468 faces in 205 images. Each of the faces is labeled with a bounding box so the faces
can be extracted from the images. They are manually annotated with six landmarks
(centers of the eyes, tip of nose, the two corners and center of mouth) and the head
pose angles pitch, yaw and roll. The yaw angle has a range from −105◦ to 90◦ , the
pitch angle from −45◦ to 30◦ and the roll angle from −15◦ to 15◦ , all annotated in
15◦ steps. Compared to the Prima and AFLW dataset, the negative and positive head
pose labels are interchanged.
Since the yaw angle has the widest range, this angle is normally used when testing with
this dataset. The yaw angle of the dataset has a mean of 2.6◦ , which shows that the
dataset is centered closely around zero degrees, and a standard deviation of ±38.4◦ .
The yaw’s mean and standard deviation are similar to the mean and standard deviation
(1.9◦ ± 41.8◦ ) of the AFLW dataset. A visualization of the image distribution is given
in Figure 42 in Appendix A.

2.3    Methods for Head Pose Estimation

This section presents the most commonly used computational methods for head pose
estimation (Fig. 7b-d) and the human performance in head pose estimation (Fig. 7a).
The human performance is presented as a real-life reference for the computational
methods. The different computational methods used for head pose estimation can be
categorized in appearance-based methods (Fig. 7b), model-based methods (Fig. 7c)
and nonlinear regression methods (Fig. 7d).

Human performance To the best of my knowledge, Gourier et al. [22] are the only
ones who measured human performance on a dataset that is also used for computational
head pose estimation. For their experiment, they employ a nonlinear regression method
trained on the Prima dataset and compare these results with human performance on
the same dataset. Gourier et al. [22] worked with greyscale images scaled to 23 x 30
pixels. They measured the human performance with a group of 72 subjects, consisting
equally of female and male subjects in the age range from 15 to 60 years. Images

                                          11
(a)   Human      performance                 (b) Appearance-based [46]
     (adapted from [46])

               (c) Model-based [46]                     (d) Nonlinear Regression [46]

                           Figure 7: Overview of head pose methods

from the Prima dataset were presented to the subjects for 7 seconds in a random
order and the subjects had to choose the right yaw and pitch angle from a set of
preselected answers. Gourier et al. [22] note that the psycho-physical basis for humans
to estimate head poses from static images is unknown. Hence, to find out, if humans
needed training to recognize head poses from static images, they split the subjects
into two groups, where one group was trained to recognize head poses from the static
images beforehand and the other group was not. The evaluation of the two groups
shows a significant performance difference in the pitch angle but not in the yaw angle.
A possible explanation is that the yaw angle provides a sufficient indication of the
other person’s focus when engaging in a conversation, whereas the pitch angle is less
important and even lesser, when people are sitting having their heads approximately

                                             12
at the same height [63]. In summary, humans seem to be better in recognizing the
yaw angle ranges. The results for the pitch angle show that the group with training
had a 3.2◦ lower MAE than the group without training. For both angles, the front
and profile angles were recognized best. Table 2 provides a comparison of the human
performance and the main competing approach [49] of this thesis. Patacchiola and
Cangelosi [49] employed a CNN for training, but with a higher image resolution (64 x
64 pixels) than Gourier et al. [22]. The results for the human performance (Table 2)
measured by Gourier et al. [22] are the average values of the two groups. Considering
the constraints of the static images and their low image resolution, the results of the
human performance can only be seen as indication not as their general estimation
ability.

Table 2: Comparison of human performance to a nonlinear regression approach with CNNs
on the Prima dataset, results are the MAEs

              Approach           Network          Yaw        Pitch
              Gourier et al.     human per-       11.9◦      11◦
              [22]               formance

              Patacchiola        based on         7.74◦      10.57◦
              and Cangelosi      LeNet-5
              [49]

Appearance-based methods This method compares head images with already an-
notated exemplary heads. The exemplary heads are annotated with discrete poses and
ordered in a systematic structure. To estimate the head pose of a new image, the
head’s view is compared to the pool of heads in order to find the most similar view
[46] (Fig. 7b). There are different metrics for comparing a new head’s view to the
set of examples. Beymer uses a mean squared error over different subportions of the
image [7]. In his approach, the hierarchical feature finder first detects the two eyes
and at least one nose feature in the input image. On basis of the feature positions, the
pose is searched hierarchically in a coarse-to-fine strategy in the pyramid ordered head

                                             13
pool. The possible poses need to pass a thresholding correlation test, where different
subportions of the image are evaluated and thus lead to the supposedly best fitting
pose. Niyogi and Freeman [47] learn to map input images non-linearly to parametric
descriptions. To estimate the pose of a new input image they use a non-parametric
estimation technique, the so-called Vector Quantification, which is closely related to
the nearest neighbor algorithm, but with a vector as output.
These appearance-based methods have the advantage of a relatively simple implemen-
tation. Moreover the templates for comparing can be easily expanded at any time and
therefore allow the system to be adapted easily to new conditions. The main drawback
of the appearance-based methods is that the whole concept is based on the premise
that similar images have also similar head poses. But when there are images with
different persons displaying similar head poses, the impact of identity could outweigh
the impact of a different pose, which makes template matching difficult [46].
Some researches address the problem of incorrect template matching. One approach is
to use Gabor wavelets [58, 78]. The Gabor wavelets extract the global and local features
from the image while preserving the structural frequency information [42]. The images
are then convolved with related kernels. Wu and Trivedi [78] use the Gabor wavelets for
feature detection on their input images. They use the extracted features for a nearest
prototype matching by measuring the Euclidean distance. Another strategy developed
to overcome this problem are Laplacian-of-Gaussian filters. For preparation, the image
is first smoothed with a Gaussian kernel to decrease its noise sensitivity. The Laplacian
filters then detect areas of rapid intensity change and thus can identify edges in images.
The Laplacian filters bring out common facial features and diminish distinctive facial
features [19].
However, due to the serious limitations of appearance-based methods, recent approaches
in head pose estimation tend to use other methods.

Model-based methods This method uses geometric measurements, facial key-point
locations and averaged models of the face. The model-based approach is based on the
assumption that humans estimate head poses from symmetry deviation. For humans,
the most important indications are the movement of the head from one side to the
other and the upward and downward movement of the nose [77].

                                           14
One common model-based approach is to estimate the head pose by computing the ge-
ometric relation of facial key-points to a mapped facial model [43, 17]. It is important
to choose few reliable key points which can be reused over a range of different faces.
The facial model is created by averaging the structure of faces. Gee and Cipolla [17]
identify the eyes, the nose and the corners of the mouth as reliable facial key-points
(Fig. 7c). The symmetry axis of these five points is the connection between the mid-
point line between the eyes and the center of the mouth. They estimate the head pose
with the 3D-information of the nose position by assuming a fixed nose length. This is
similar to the POSIT algorithm (Sec. 2.2) that operates with key-points of a 2D-model
and an averaged 3D-model, where the key-points of the 2D-model are fitted onto the
3D-model to estimate the head pose.
Another common model-based approach is to compare the facial key-points with an
underlying geometric coordinate system instead of an averaged facial model. The head
pose is estimated from the distance of the facial key-points to the reference coordinate
system [39]. The drawback of this method is that it requires a very high precision for
certain angles of the face in order to have an accurate head pose estimation.
The overall advantage of model-based methods is that the head pose can be obtained
with only a few facial key-points. The drawback is that the accuracy of the key-point
detection is closely linked to the accuracy of the head pose estimation. Especially oc-
clusions can interfere with the detection of facial-key points. For example, if a person
wears glasses the eyes cannot be detected properly, but the position of the eyes is im-
portant for head pose estimation with model-based methods.
In contrast to the appearance-based approach, where the whole face is used for compar-
ison, the model-based approach only uses the key-points. The model-based approach
is therefore in general more sensitive to occlusions and less robust to detection failures
than the appearance-based approach.

Nonlinear regression methods There are different techniques to learn a nonlinear
mapping from head images to pose angles (Fig. 7d). The prerequisite for good accu-
racy in this method is the training with a consistent and labeled dataset of sufficient
size. Examples of datasets used with nonlinear regression methods are introduced in
Section 2.2.

                                           15
In earlier years multilayer perceptrons (MLPs) and support vector regression were
mainly used as nonlinear regression methods. Shiele and Waibel [57] realized in 1995
one of the first approaches with MLPs for learning the yaw angle of the head orienta-
tion. They used the angle for identifying the focus of attention. Their MLP network
had one hidden layer (50 neurons) and was trained with standard back-propagation,
with which they achieved good results. Rainer Stiefelhagen [64] also used a MLP net-
work with one hidden layer for learning the pitch and yaw angle. He achieved good
results on the Prima dataset (Sec. 2.2). As another regression method, support vector
regression was successfully used in combination with dimensionality reducing methods
such as principal component analysis [40] and localized gradient orientation histograms
[45].

Nonlinear regression with CNNs Around 2007, research papers using CNNs
emerged in the field of head pose estimation. The advantage of CNNs in contrast
to MLPs is their high tolerance to shift and distortion variance. One of the first ap-
proaches for head pose estimation with CNNs was proposed by Osadchy et al. [48].
They describe a real-time method for simultaneous face detection and head pose esti-
mation, as these two tasks are closely related. Their approach is based on energy-based
models and CNNs. The CNN involved is similar to LeNet-5 [37], but has more feature
maps. LeNet-5 was first published by LeCun [37], who is also one of the authors of
the aforementioned paper [48]. Osadchy et al. [48] collected and annotated 30,000 im-
ages under laboratory conditions and trained the yaw angle and the in-plane rotation.
Considering an error of 15◦ , they mention over 80% accuracy in the yaw angle range
and 95% accuracy within the in-plane rotation, when testing on standard datasets of
that time. Unfortunately, they do not compare their results to other approaches. Ad-
ditionally, they do not share the mean absolute error, which is a more accurate metric
for comparison. In 2014, Ahn et al. [2] reported the best result until then on the Biwi
Kinect head pose dataset (Sec. 2.2). They used a CNN of four convolutional layers and
two fully connected layers, which did not consider the depth, but outperformed other
approaches in speed and accuracy that used the depth information such as Fanelli et
al. [14], who used random forest regression.

                                          16
Other CNN approaches considered the depth information in RGB images. Mukherjee
[44] and Venturelli et al. [75] use images from in-car scenarios. Mukherjee [44] trained
the RGB and depth data separately on GoogleLeNet [69] models. Venturelli et al.
[75] trained with a more shallow network of five convolutional layers and three fully
connected layers and directly included the depth data. Both approaches used one net-
work to train all three angles and tested it on the Biwi Kinect dataset for comparison.
Venturelli et al. [75] outperformed Mukherjee’s [44] approach.
There are three recent approaches, which tested their CNNs on the AFLW and AFW
dataset [55, 33, 49] (Table 3). The first approach is from Ruiz et al. [55]. Their ar-
chitecture is based on the ResNet50 architecture, combined with three mean squared
error and cross entropy losses, one for each head pose angle. For testing on the AFLW
dataset, they use a ResNet with 50 layers (ResNet50), which they pre-trained on the
ImageNet dataset12 and finetuned on the AFLW dataset. However, it is not clear
what fraction of the AFLW dataset they used for testing. They do not mention cross-
validation, which makes it less comparable to this thesis approach. They also test on
the AFW dataset, with a ResNet50 model trained on the synthesized 300W-LP dataset
[81]. According to Ruiz et al. [55], the pre-trained ResNet50 achieves very good results
on the AFW dataset. But these results are not comparable to this thesis approach
(Ch. 4), where the AFW datasets are tested on networks trained on the entire AFLW
dataset.
Kumar et al. [33] present a novel architecture called Heatmap-CNN (H-CNN), which
can learn local and global structural dependencies. They use it for learning key-points
and also learn the head poses as a by-product. The H-CNN architecture contains In-
ception modules [69], which are based on a similar architectural idea like the residual
building blocks of the ResNet (Sec. 3.4). Kumar et al. [33] mention finetuning on
the AFLW dataset, but provide no further information on the pre-training. As in the
approach of Ruiz et al. [55], it is not clear what fraction of the AFLW dataset they
use for testing and whether they employed cross-validation. This makes it less compa-
rable to this thesis approach on the AFW dataset. Kumar et al. [33] also test their
trained network on the AFW dataset. They display the result in a cumulative error
 12
      http://www.image-net.org/, accessed 22.01.2018.

                                              17
distribution, which shows very good results for their approach. Unfortunately, they
do not provide the result as mean absolute error (MAE), so their result on the AFW
dataset is not comparable to the following experiments (Ch. 4).
The last of the three recent approaches stems from Patacchiola and Cangelosi [49].
They evaluate different adaptive gradient methods in combination with four architec-
tures of varied layer complexity. Patacchiola and Cangelosi [49] trained on the Prima
dataset as well as on the AFWL dataset and used the AFW as a test dataset for a
model trained on the whole AFLW dataset. Their best network trained on the AFLW
dataset has three convolutional layers, two fully connected layers and is structurally
based on the LeNet-5 architecture [37]. This work includes a reimplementation of their
best network trained on the AFLW dataset (Sec. 4.3).

Table 3: Overview of recent CNN approaches on the AFLW Dataset, results are the MAEs,
sorted by the yaw angle

                                                                  AFLW
  Approach                      Network
                                                         Yaw      Pitch      Roll
  Patacchiola and Can-          based on LeNet-5         9.51◦    6.8◦      4.15◦
  gelosi [49]

  Kumar et al. [33]             Pre-trained H-CNN        6.45◦    5.85◦     8.75◦

  Ruiz et al. [55]              Pre-trained multi-loss   6.26◦    5.89◦     3.82◦
                                ResNet50

                                           18
3     Convolutional Neural Networks (CNNs)
Focusing on object recognition in images, this chapter explains the architecture and
training behavior of classic CNNs and introduces various pre-processing and regular-
ization techniques. Furthermore, the residual network architecture and its differences
with classic CNN architectures are characterized.

3.1    Training with CNNs

There are two information flows while training CNNs: One from the input to the out-
put of the network called forward propagation (Sec. 3.1.1) and one from the output
to the input called backpropagation (Sec. 3.1.2). The forward propagation and back-
propagation are explained in the following with the help of a classic CNN architecture
as shown in Figure 8.
The architecture of a classic CNN includes an input layer, an output layer and several
hidden layers in between (Fig. 8). The layers between the input layer and the output
layer are called hidden layers, because they are not accessible from the outside of the
network. The three basic types of hidden layers are the convolutional layer, the activa-
tion layer and the pooling layer [20, p. 336]. Normalization layers like batch normal-
ization or local response normalization are also often used as hidden layers (Sec. 3.3).
The classic CNN depicted in Figure 8 has a convolutional layer as input layer, sets
consisting of convolutional, activation and pooling layers as hidden layers and a fully
connected layer as output layer. The output of the last layer is used for classification.
The input of the CNN is a RGB image of a Samoyed dog. Because the image has three
color channels (red, green and blue) the input image has a depth of three. The output
class Samoyed has the highest probability, thus the image is classified correctly. The
layers and their functionalities are explained in the following.

                                           19
Figure 8: Standard CNN architecture [36]

3.1.1   Forward Propagation

The images are fed into the network as 2D arrays, i. e. matrices, which contain the
values of the image pixels. Most networks require that all matrices are of the same size
with the height equalling the width.
The convolutional layers have a number of kernels defined, the so-called filters, which
convolve the 2D input to extract features. The weights of the filters are initialized at
the beginning of the training and adjusted during training through backpropagation.
The filters need to be of smaller size than the input, but with the same depth. They
slide like a window with a defined stride over the input. Hereby, the weight stays the
same and is therefore also called a shared weight [36]. The stride defines the number of
pixels the filter moves horizontally or vertically in one step. A neural network can also
use a bias in addition to the weights [36]. Then, a single bias with a constant value is
added to all nodes of each layer. This helps to shift the activation functions to the left
or right during training. The bias is also updated in backpropagation [56].
Figure 9 shows the convolution of the input matrix X (size 3 x 3) with the filter W
(size 2 x 2) to output H (size 2 x 2). To simplify the representation the bias is omit-
ted. As required, the filter W is smaller than the input X. The filter W slides over
X with a defined stride of one pixel, as indicated with the red and blue rectangle in

                                           20
Figure 9: Representation of convolutional layer during forward propagation (adapted from
[20, p. 330])

the representation of the input X. Each pixel of the 2 x 2 pixel units of input X is
multiplied with the corresponding weight of the filter W . The multiplied pixel values
are added up as shown in the fields of output H. In this example, input X has a depth
of one. If the input is deeper, the filter has the same depth [36]. Figure 9 depicts an
example, where the input size is not preserved during the convolution. The output
size is smaller, because not all pixels can be covered with the sliding filter W of stride
one. Both the size of the filter and the stride affect the output size. To preserve the
input size throughout convolution, it is common to use zero padding, which expands
the matrix by adding zeros [20, p. 349].
One filter produces one output matrix, which is called a feature map [36]. Feature
maps of lower convolutional layers detect mostly low level features like edges. Feature
maps of subsequent layers can detect more complex features like combinations of edges
and even combinations of features that resemble object parts. The objects are then
classified as a learned combination of these parts.
The layer following the convolutional layer contains an activation function. Most of
these functions introduce non-linearity by applying a fixed mathematical function to
each value. The graph in Figure 10 shows the hyperbolic tangent activation function
tanh(x), which will also be used in the following experiments (Ch. 4). ”Hyperbolic”
refers to plane geometry, where hyperbola is a special curve consisting of two symmet-

                                           21
rical branches extending into infinity. Consequently, it is a continuous function that
computes values in the range of [−1, 1]. The function is defined by the ratio of the
hyperbolic sine and cosine function (Eq. 1).

                                                               ex − e−x
                                                                       
                                         sinh(x)
                         tanh(x) =                     =                              (1)
                                         cosh(x)               ex + e−x

                        Figure 10: Hyperbolic tangent function

The pooling layer is placed after the activation layer and reduces the size of the feature
map. Similar features are hereby merged into one feature [36]. Figure 11 shows the
max pooling method, which picks the highest value from each position of the window.
The window of size 2 x 2 slides over input X with a stride of two. The red and the blue
square depict the first and second position of the window. Hence, the max value of
the red square is the first value in the output matrix H. The output H is half the size
of the input X, because a stride of two is used. Another widely used pooling method
is average pooling, where the average value of the pixels in the respective window is
calculated.

                                             22
Figure 11: Representation of max pooling during forward propagation

As the pooling layer reduces the size of the feature map, the number of filters usu-
ally increases. This design principle is being introduced in the pioneering architecture
LeNet-5 [36]. The VGGNet [59] uses and extends this design principle in the following
ways: (1) If the size of the feature map in the VGGNet stays the same, the number of
filters also remains unchanged. (2) If the size of the feature maps is halved, the num-
ber of filters doubles. These design principles are also adapted by residual networks
(Sec. 3.4).
The fully connected layers are placed at the end of the network to identify the most
important features for classifying the object. They connect every input neuron to every
neuron of the next layer. This can be compared to the behavior of hidden layers in a
MLP [53]. Fully connected layers need 1D arrays as input. Thus, the 2D outputs of
the pooling layer are flattened into 1D arrays, which also removes the spatial relations.
If more than one fully connected layer is employed, activation functions introduce non-
linearity between them. The last fully connected layer is the output layer and, which
size corresponds to the number of the output classes.

3.1.2   Backpropagation

Backpropagation is used to minimize the computed loss, i. e. the error of the network,
by updating the filter weights with the gradient information [56]. The goal is to find the
global minimum with the lowest possible error. The gradient vector indicates for each
weight, by how much the error of the network would de- or increase, if the weights were

                                           23
changed. In the following experiments (Ch. 4) the sum of squares of the differences
of the predicted value ŷ and the target value y is used to calculate the loss E (Eq. 2).
The parameter n refers to the number of training examples in one training iteration.

                                          n
                                          X
                                     E=     (yi − ŷi )2                                    (2)
                                          i=1

The gradients for updating the weights are computed with the chain rule. In Figure 12,
the chain rule is shown exemplarily on a logic gate embedded in a circuit. The black
arrows show the forward propagation and the red arrows show the backward propaga-
tion. The inputs x and y are propagated forward. The function f is applied on the
inputs x and y and the output z is computed. The loss is computed at the end of the
network and is used to calculate the gradients from back to front by the chain rule. In
                                                                ∂E
the backpropagation, the gate receives as input the gradient    ∂z
                                                                   ,   which defines the loss
with respect to z. This gradient is used to compute the gradients of x and y by the
                                                                 ∂z                         ∂E
chain rule. To compute the gradient of x, the local gradient     ∂x
                                                                       is multiplied with   ∂z
                                                                                               .
The chain rule is repeatedly applied to each gate in the circuit from back to front.

           Figure 12: Chain rule representation on a gate embedded in a circuit   13

In a CNN, the chain rule is repeatedly applied to compute the gradients of the weight
matrix and the input matrix in the convolutional layers [20, Ch. 9]. Figure 13 shows
the calculation of gradients for the weight matrix. It is the same setting as in Figure 9,
but with the information flowing from back to front. G contains the gradients coming
 13
      Visualization adapted from: https://kratzert.github.io/2016/02/12/understanding-the-
gradient-flow-through-the-batch-normalization-layer.html, accessed 04.03.2018.

                                             24
from the end of the network, which are used to compute the gradients for the weights
in W with respect to the values of input X. The gradients of the weights, displayed in
this figure in filter W , are used to update the weights. The gradients for input X are
computed with respect to W following the same rule and are used to propagate the
error further back through the network.

          Figure 13: Representation of convolutional layer during backpropagation   14

Updating the weights means to minimize the error of the network. To minimize the
network error, the weights are shifted in the opposite direction of the gradient (Eq. 3),
as the gradient points in the direction of the greatest increase. η describes the step size
of the weight update and is called the learning rate. The network is updated repeti-
tively during training until the global minimum is found, the end of the defined steps
is reached or another stop criterion is met.

                                                    ∂E
                                        4wij = −η                                        (3)
                                                    ∂wij

There are several gradient algorithms, which improve this basic algorithm [54]. Two
of them are used in the following experiments (Ch. 4): The SGDMomentum, which is
a Stochastic Gradient Descent (SGD) with a momentum, and the RMSProp.
The SGDMomentum method uses the momentum, because one problem possibly occur-
ring during minimizing the error is the weight movement towards the global minimum
 14
      Visualization   adapted   from:      https://becominghuman.ai/back-propagation-in-
convolutional-neural-networks-intuition-and-code-714ef1c38199, accessed 04.03.2018.

                                             25
getting stuck in local minima on the error surface, which resembles ravines in a 3D
landscape. The momentum overcomes this problem by adding a fraction of the previous
weight updates to the current one [56]. This filters out high curves on the error-surface
of the weight space and influences the direction of the next weight movement, so ravines
can be overstepped more easily. The SGD methods in general and also the SGDMo-
mentum method are often used with a pre-defined schedule for decaying the learning
rate [11, 52]. This means that the weight updates are decreased during the learning
process.
The RMSProp is an adaptive gradient method, which was proposed by Geoffrey Hinton
in his Coursera Class.15 The group of adaptive gradient methods uses gradient infor-
mation for decreasing the learning rate automatically. As shown in the Equation 4, the
RMSProp uses the decayed value of all previous gradients as second order information.
The current gradient E[g 2 ]t is the sum of all previous gradients E[g 2 ]t−1 , which are
decayed by the parameter γ, and the current gradient gt2 . The learning rate is then
          p
divided by E[g 2 ]t and thus decreases over time.

                                  E[g 2 ]t = γE[g 2 ]t−1 + gt2                        (4)

Usually, all gradient methods use the mini-batch variant to update the trainable net-
work layers after n training examples. The mini-batch update rule is a combination of
update rules of the Stochastic Descent (SD) and the SGD [54]. The SD computes the
gradient based on the complete training dataset, while the SGD updates the network
after each training example. The update rule of the SD leads to a very slow convergence
towards the global minimum, as the weight is only updated after the entire training
dataset. On the other hand, the update rule of the SGD can lead to a high variance
during convergence, because it updates the weights after each training example. The
mini-batch update rule provides a solution for both disadvantages by updating the net-
work after n training examples, which results in a more stable and faster convergence
with a reduced variance.
  15
       http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf,   accessed
07.03.2018.

                                              26
You can also read
NEXT SLIDES ... Cancel