# Head Pose Estimation using Deep Learning

←

**Page content transcription**

If your browser does not render page correctly, please read the page content below

Head Pose Estimation using Deep Learning Master’s thesis in fulfillment of the requirements for the degree of Master of Science in Computing in the Humanities Faculty of Information Systems and Applied Computer Sciences University of Bamberg Author: Supervisor: Ines Sophia Rieger Prof. Dr. Ute Schmid (Matr. No. 1838490) in cooperation with Fraunhofer-Institute for Integrated Circuits IIS Group for Intelligent Systems supervised by Thomas Hauenstein and Sebastian Hettenkofer Bamberg, April 16, 2018

Abstract Head poses are an important mean of non-verbal human communication and thus a crucial element in human-computer interaction. While computational systems have been trained with various methods for head pose estimation in the recent years, ap- proaches based on convolutional neural networks (CNNs) for image processing have so far proven to be one of the most promising ones. This master’s thesis starts of improving head pose estimation by reimplementing a recent CNN approach based on the shallow LeNet-5. As a new approach in head pose estimation, this thesis focuses on residual networks (ResNets), a subgroup of CNNs specifically optimized for very deep networks. To train and test the approaches, the Annotated Facial Landmarks in the Wild (AFLW) dataset and the Annotated Faces in the Wild (AFW) benchmark dataset were used. The performance of the reimplemented network and the imple- mented ResNets of various architectures were evaluated on the AFLW dataset. The performance is hereby measured in mean absolute error and accuracy. Furthermore, the ResNets with a depth of 18 layers were tested on the AFW dataset. The best performance of all implemented ResNets was achieved by the 18 layer ResNet adapted for an input size of 112 x 112 pixels. In comparison with the reimplemented network and other state-of-the-art approaches, the best ResNet performs equal or better on the AFLW dataset and outperforms on the AFW dataset.

CONTENTS Contents List of Abbreviations iv List of Figures v List of Tables viii 1 Introduction 1 2 Head Pose Estimation 3 2.1 Representation of Head Poses by Euler Angles . . . . . . . . . . . . . . 3 2.2 Head Pose Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Methods for Head Pose Estimation . . . . . . . . . . . . . . . . . . . . 11 3 Convolutional Neural Networks (CNNs) 19 3.1 Training with CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Pre-Processing Image Data . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Regularization Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Residual Networks (ResNets) . . . . . . . . . . . . . . . . . . . . . . . 33 4 Experiments 39 4.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Reimplementation of Patacchiola and Cangelosi [49] . . . . . . . . . . . 43 4.3.1 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Implementation of ResNets . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Comparison of Approaches 61 5.1 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

CONTENTS 5.2 Comparison of the Number of Trainable Variable Parameters . . . . . . 64 6 Conclusion 65 References 69 A Dataset Histograms 77 B System Specifications 85 C Training Loss 86

LIST OF ABBREVIATIONS List of Abbreviations AI Artificial Intelligence AFLW Annotated Facial Landmarks in the Wild (dataset) AFW Annotated Faces in the Wild (dataset) BN Batch Normalization CNN Convolutional Neural Network H-CNN Heatmap-Convolutional Neural Network LRN Local Response Normalization MAE Mean Absolute Error MLP Multilayer Perceptron PCA Principle Component Analysis POS Pose from Orthography and Scaling RELU Rectified Linear Unit ResNet Residual Network SD Stochastic Descent SGD Stochastic Gradient Descent SOP Scaled Orthographic Projection

LIST OF FIGURES List of Figures 1 Tait-Bryan angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Head orientation with yaw, pitch and roll angle [3] . . . . . . . . . . . . 4 3 Example of cropped and scaled Prima dataset images . . . . . . . . . . 6 4 Prima dataset measurement setting . . . . . . . . . . . . . . . . . . . . 7 5 Examples of cropped and scaled AFLW dataset images in grayscale . . 9 7 Overview of head pose methods . . . . . . . . . . . . . . . . . . . . . . 12 8 Standard CNN architecture [36] . . . . . . . . . . . . . . . . . . . . . . 20 9 Representation of convolutional layer during forward propagation . . . 21 10 Hyperbolic tangent function . . . . . . . . . . . . . . . . . . . . . . . . 22 11 Representation of max pooling during forward propagation . . . . . . . 23 12 Representation of a gate embedded in a circuit during backpropagation 24 13 Representation of convolutional layer during forward propagation . . . 25 14 Derivative of hyperbolic tangent function . . . . . . . . . . . . . . . . . 27 15 Representation of dropout in a neural network with two hidden layers [61] 31 16 Training and testing error in plain networks [25] . . . . . . . . . . . . . 34 18 ResNets of different depths [25] . . . . . . . . . . . . . . . . . . . . . . 36 19 Left: original, right: pre-activated residual block [26] . . . . . . . . . . 37 20 Architecture of reimplemented network . . . . . . . . . . . . . . . . . . 45 21 Training losses of five-fold cross-validation on AFLW-64 dataset, reim- plementation of Patacchiola and Cangelosi [49], yaw angle . . . . . . . 48 22 Implemented residual building block . . . . . . . . . . . . . . . . . . . 50 23 Implemented ResNet18 . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 24 Confusion matrix of the ResNet18-112 as heatmap, yaw angle . . . . . 55 25 Confusion matrix of the ResNet18-112 as heatmap, pitch angle . . . . . 55 26 Confusion matrix of the ResNet18-112 as heatmap, roll angle . . . . . . 56 27 Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18- 64, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 28 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18- 112, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

LIST OF FIGURES 29 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34- 112, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 30 Confusion matrix of the ResNet18-112 as heatmap, tested on the AFW- 112 dataset, yaw angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 31 Training loss on entire AFLW-64 dataset, ResNet18-64 . . . . . . . . . 60 32 Training loss on entire AFLW-112 dataset, ResNet18-112 . . . . . . . . 60 33 AFLW histogram, yaw angle with entire label range of −125◦ to 168◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 77 34 AFLW histogram, pitch angle with entire label range of ±90◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . . 78 35 AFLW histogram, pitch angle with entire label range of −178◦ to 179◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 78 36 AFLW-64 histogram, yaw angle with restricted label range of ±100◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 79 37 AFLW-64 histogram, pitch angle with restricted label range of ±45◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 79 38 AFLW-64 histogram, roll angle with restricted label range of ±25◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . . 80 39 AFLW-112 histogram, yaw angle with restricted label range of ±100◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 80 40 AFLW-112 histogram, pitch angle with restricted label range of ±45◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . 81 41 AFLW-64 histogram, roll angle with restricted label range of ±25◦ , with plotted mean (solid line) and std. dev. (dashed lines) . . . . . . . . . . 81 42 AFW histogram, yaw angle with entire label range of −105◦ to 90◦ . . 82 43 AFW-64 histogram, yaw angle with restricted label range of ±100◦ . . 82 44 AFW-112 histogram, yaw angle with restricted label range of ±100◦ . . 83 45 AFLW-112 histograms with training data distribution of the five-fold cross-validation for ResNet18-112 . . . . . . . . . . . . . . . . . . . . . 84 46 Training losses of five-fold cross-validation on AFLW-64 dataset, reim- plementation of Patacchiola and Cangelosi [49], pitch angle . . . . . . . 86

LIST OF FIGURES 47 Training losses of five-fold cross-validation on AFLW-64 dataset, reim- plementation of Patacchiola and Cangelosi [49], roll angle . . . . . . . . 87 48 Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18- 64, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 49 Training losses of five-fold cross-validation on AFLW-64 dataset, ResNet18- 64, roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 50 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18- 112, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 51 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet18- 112, roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 52 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34- 112, pitch angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 53 Training losses of five-fold cross-validation on AFLW-112 dataset, ResNet34- 112, roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

LIST OF TABLES List of Tables 1 Overview of head pose datasets . . . . . . . . . . . . . . . . . . . . . . 5 2 Comparison of human performance to a nonlinear regression approach with CNNs on the Prima dataset, results are the MAEs . . . . . . . . . 13 3 Overview of recent CNN approaches on the AFLW Dataset, results are the MAEs, sorted by the yaw angle . . . . . . . . . . . . . . . . . . . . 18 4 Input datasets with a restricted label range: yaw (±100◦ ), pitch (±45◦ ), roll (±25◦ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 Size of the AFLW dataset during training and testing . . . . . . . . . . 42 6 Five-fold cross-validation: mean and standard deviation of the AFLW- 112 training datasets, yaw angle . . . . . . . . . . . . . . . . . . . . . . 43 7 Parameters of reimplemented network . . . . . . . . . . . . . . . . . . . 46 8 Results of reimplementation and original approach [49] . . . . . . . . . 47 9 Parameters of ResNet implementation for convolutional layers . . . . . 51 10 Parameters of ResNet implementation, see also Table 9 . . . . . . . . . 52 11 Results of ResNets tested on the AFLW-64 and AFLW-112 datasets . . 54 12 Results of ResNets tested on AFW dataset . . . . . . . . . . . . . . . . 58 13 Results of the ResNets with 18 layers and of the approach of Patacchiola and Cangelosi [49] on the AFLW and AFW datasets, results are the MAEs, sorted by result on the AFW dataset . . . . . . . . . . . . . . . 62 14 Comparison of results achieved by different methods on the AFLW and AFW dataset, results are the MAEs, sorted by result on the AFW dataset 63 15 Results of the ResNets and the pre-trained networks on the AFLW dataset, results are the MAEs, sorted by result of the yaw angle . . . . 63 16 Results of the self-implemented networks on the AFLW dataset in MAE and the number of trainable variable parameters, sorted by parameter number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 17 System specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1 Introduction Head poses are a key element of human bodily communication. When humans interact, a lot of the communication happens non-verbally through gestures, facial expressions, or gazes. Head poses are an integral part of gestures that can serve to give content- related feedback, indicate the focus of attention, or express emotions. A common content-related feedback is the head-nodding or shaking, which is, depending on the cultural context, usually interpreted as ”yes” or ”no” [4, p. 111]. One can also deliver spacial information by indicating the location of objects with the head’s orientation. The focus of attention can be revealed by the person’s head orientation [5] or by their gaze [30]. Furthermore, head poses help to interpret emotions. Shame for example, is often expressed by a lowered head and an averted gaze [30, p. 417]. Since head poses are a core principle of non-verbal human communication, they are also important in various contexts of human-computer interaction: In human-robot interaction, multimodal humanoid robots, which are for example used in domestic en- vironments [73], are not only trained for abilities like speech recognition or face- and hand-tracking, but also for head pose estimation to provide a natural interaction with its users [65]. In the context of driver assistance systems, one of the common use cases of head pose estimation is monitoring the driver’s field of view: By observing the head pose, the system can estimate the driver’s level of attention and encourage him to keep his eyes on the road again [34]. Driver assistance systems also monitor the surrounding pedestrians’ head poses regarding their focus of attention. Together with the detection of their position around the car, this helps to avoid collisions [18, 6]. In the field of behavioral studies, head pose estimation allows systems to monitor social interactions [6, 10], detect social groups via surveillance cameras [35], and observe a person’s target of interest [38]. In combination with the orientation of the eyes, head poses serve as gaze prediction [46, 80, 16]. When it comes to estimating head poses with computational systems, one of the most promising methods are convolutional neural networks (CNNs) [49]. CNNs are a spe- cialized kind of feed-forward neural network for deep learning in the field of machine learning [20], applied for processing images, videos, speech or audio [36]. CNNs are successfully used for object recognition (i. e. classification of objects) in images, but yet 1

there are only few approaches that use them to estimate head poses (Sec. 2.3). One of the best approaches using CNNs for head pose estimation is from Patacchiola and Can- gelosi [49], who train networks of various depths on in-the-wild datasets. In-the-wild datasets ensure real-world applicability and thus, the following datasets consisting of images from the image hosting website Flickr1 are used for training and testing: The Annotated Facial Landmarks in the Wild (AFLW) dataset and the Annotated Faces in the Wild (AFW) benchmark dataset. As a starting point for the experiments, one of Patacchiola and Cangelosi’s [49] networks, the one performing best on the AFLW dataset, is reimplemented and evaluated on the AFLW dataset. Because Patacchiola and Cangelosi [49] examined shallow CNNs based on the LeNet-5 [37], this thesis ex- plores the deeper residual networks (ResNets) with the following research questions: (1) How do ResNets of different depths perform on images with differ- ent resolutions of the AFLW and AFW datasets? (2) How do the implemented ResNets perform in comparison with the reim- plemented network based on the LeNet-5? For the reimplemented network and the ResNets, the same pre-processing and eval- uation methods are applied. The performance is measured in mean absolute error (MAE) and accuracy. Additionally, the number of parameters is compared for the self- implemented networks. In the conducted experiments, ResNets of various depths and adapted for different input sizes are implemented and then evaluated on the AFLW dataset with a five-fold cross-validation. Furthermore, ResNets with a depth of 18 layers are trained on the entire AFLW dataset and then tested on the AFW dataset. The results of the ResNets are further compared to other competing approaches. The thesis is organized as follows. In Chapter 2, fundamental background informa- tion about head pose estimation, i. e. representation of head poses by Euler Angles, most-used datasets and an overview of head pose estimation methods, is introduced. Chapter 3 explains the training of CNNs including pre-processing and regularization methods as well as the functionality of ResNets. The realization and results of the con- 1 https://www.flickr.com/, accessed 16.04.2018. 2

ducted experiments are described in Chapter 4, followed by a comparison of approaches in Chapter 5, and the conclusion in Chapter 6. 2 Head Pose Estimation This chapter provides relevant background information on head pose estimation of computational systems and is outlined as follows: First, the notation of the Euler Angles is explained and the most used datasets are introduced. Then, an overview of head pose estimation methods including a reference to the human performance is given. 2.1 Representation of Head Poses by Euler Angles Euler Angles generally measure the orientation of a rigid body in a fixed coordinate system [13], whereby the Tait-Bryan angle notation, a notation form of the Euler Angles commonly used in the aerospace context, is applied to formally define head poses. In the Tait-Bryan notation, the three angles to describe the objects’ pose are called yaw, pitch and roll, commonly represented by ψ, θ and φ as in Figure 1. These three angles can be defined by a rotation sequence of three elemental rotations. Figure 1 shows the status of a rotated coordinate system (red: X, Y, Z) after a common intrinsic rotation sequence of Z −Y 0 −X [72]. The axes x, y, and z of the blue coordinate system thereby remain fixed as a reference coordinate system. An intrinsic rotation is about the local axes, which are at the geometric center of the object. Thus, the object rotates about the axes of the rotating system, in this sequence first about the Z-axis, then about the former Y -axis, now N (y 0 ), and lastly about the X-axis, thereby changing the axes of the system themselves after each rotation. While the green axis N (y 0 ) represents the position of the Y axis after the rotation about the Z axis, the green axis N ⊥ represents the X-axis after the rotation about the N (y 0 )-axis. After the three elemental rotations, the yaw angle ψ is between the y-axis and the N (y 0 )-axis, the pitch angle θ between the N ⊥ -axis and the current X-axis and the roll angle φ between the N (y 0 )-axis and the current Y -axis. Figure 2 depicts the head as an intrinsically rotated object with orientations of the yaw, pitch and roll angles. 3

Figure 1: Tait-Bryan angles 2 Figure 2: Head orientation with yaw, pitch and roll angle [3] Ferrario et al.’s [15] research shows the average head movement range of healthy young adults. The mean ranges calculated from the data of thirty men and thirty women are the following: • yaw angle: −79.8◦ to +75.3◦ • pitch angle: −60.4◦ to +69.6◦ • roll angle: −40.9◦ to +36.3◦ 2 Titel: Taitbrianzyx.svg, Author: Juansempere, Source: https://commons.wikimedia.org/ wiki/File:Taitbrianzyx.svg, accessed 25.02.2018, Licence: Creative Commons Attribution 3.0 Un- ported Licence (https://creativecommons.org/licenses/by/3.0/deed.en, accessed 25.02.2018). 4

The numbers indicate that the flexibility of the head seems to vary depending on the direction of the movement. Since the yaw angle has the widest range, it is often used to compare various head pose estimation approaches. 2.2 Head Pose Datasets The quality of the dataset is one of the most critical aspects when training deep neural networks. In the following, the most recent used datasets for head pose estimation with CNNs are described. Table 1 shows a summarized overview of these datasets. Since the trained systems should be robust enough for real-life situations, the trend is to use real-life images for training and testing. Two of the described datasets (AFLW and AFW), the ones used in this thesis’ approach, are of such nature and thus called in-the-wild datasets. Table 1: Overview of head pose datasets Name Yaw Pitch Roll Number Annotation of Faces Process Prima ±90◦ , ±90◦ , not anno- 2,790 subjects 15◦ steps 15◦ steps tated looked at de- gree markers Biwi Kinect ±75◦ , ±60◦ , not anno- 15,678 faceshift soft- cont. cont. tated ware values values Annotated Fa- −125◦ ±90◦ , −178◦ 25,993 POSIT algo- cial Landmarks to 168◦ , cont. to 179◦ , rithm (manu- in the Wild cont. values cont. ally annotated (AFLW) values values landmarks) Annotated −105◦ −45◦ −15◦ 468 manually an- Faces in the to 90◦ , to 30◦ , to 15◦ , notated Wild (AFW) 15◦ steps 15◦ steps 15◦ steps 5

Prima dataset [21] The Prima dataset consists of 2,790 monocular facial images extracted from videos. The dataset is downloadable online and can be used for any purpose, provided a source reference is given.3 Face coordinates for each image are stored in an extra text file. The dataset consists of 15 subjects of ages 20 to 40. Five subjects have facial hair and seven subjects wear glasses. Every image is annotated with the angles yaw and pitch in the range of ±90◦ . The yaw angle is annotated throughout the range in 15◦ steps. The pitch angle’s annotation is split: The range ±30◦ is annotated in 15◦ steps, while the rest of the range is annotated in 30◦ steps. There are no images where the pitch angle is +90◦ or −90◦ , except in cases where the yaw angle is 0◦ . Consequently, there are 93 head poses available for each person. Figure 3 shows example images of one person with a yaw angle ranging from 0◦ to 90◦ in 15◦ steps and a constant pitch angle of 0◦ . (a) (b) (c) (d) (e) (f ) (g) Figure 3: Example of cropped and scaled Prima dataset images A second series with different lighting makes it possible to test on known and unknown faces.4 For testing on known faces, a two-fold cross-validation is suggested: One fold contains the original images, while the other fold contains the images with changed lighting. For testing on unknown faces, the Jack-Knife (also called Leave One Out) algorithm is suggested: all images of one person are completely left out for training and used for testing only. Thus, for this dataset the Jack-Knife algorithm sums up to 3 http://www-prima.inrialpes.fr/perso/Gourier/Faces/HPDatabase.html, accessed 22.01.2018. 4 http://www-prima.inrialpes.fr/perso/Gourier/Faces/HPDatabase.html, accessed 22.01.2018. 6

15 training iterations in total. The measurement of the different poses in the Prima dataset was achieved through the participants successively focusing on distributed post-its in the room that marked the different yaw and pitch angles in 15◦ steps (Fig. 4). The participants were filmed in front of a neutral background from a distance of two meters and were asked not to move their eyes. Focusing on fixating one marker after the other in a very exact manner is a challenging task, but there are no mentions of further evaluation methods to confirm the accuracy of the directional measurement. Thus, it can be assumed that the dataset is error-prone with presumably a bad uniformity between different subjects in the same pose. (a) Yaw markers (b) Pitch markers Figure 4: Prima dataset measurement setting The Prima dataset is used as a benchmark in various head pose estimation approaches [71, 76, 50, 22]. In addition, the results of persons estimating the head poses of this dataset [22] are provided in Section 2.3. Biwi Kinect dataset [14] This dataset contains 15,678 images of 20 people with a yaw range of ±75◦ and a pitch range of ±60◦ . The dataset can be downloaded freely 7

for research purposes.5 The images provide depth and RGB data. The people were filmed while turning their head freely in the yaw and pitch angles. The frames of the video were annotated with the yaw and pitch angles by using the au- tomatic system faceshift.6 Faceshift is a facial motion capture software. It can capture and describe a persons facial movement, head pose and eye gaze. This information is for example used to animate virtual characters in movies and games. The Biwi Kinect database is often used in approaches, which consider depth data [44, 75]. Some of the approaches are covered in Section 2.3. Annotated Facial Landmarks in the Wild (AFLW) dataset [31] The focus of the AFLW dataset is to provide a large variety of different faces (ethnicity, pose, expression, age, gender, occlusion) in front of natural backgrounds and lighting condi- tions. The images in the dataset were extracted from the image hosting website Flickr.7 The dataset contains 25,993 annotated faces in 21,997 images (Fig. 5). The 25,993 faces are annotated with up to 21 facial landmarks (Fig. 6) and head pose informa- tions. The facial landmarks were marked manually upon visibility. Face coordinates for cutting out the faces are provided. 56% of the faces are tagged as female and 44% are tagged as male. Koestinger et al. [31] state that the rate of non-frontal faces of 66% is higher than in any other dataset. The dataset is well suited for multi-view face detection, facial landmark localization and head pose estimation. The metadata for the images are stored in a SQL database. For downloading the dataset and database a registration via email is required.8 The distribution of poses of the AFLW dataset is not uniform with very few images for the lower and higher degrees. The label range for the yaw angle is from −125.1◦ to 168.0◦ , for the pitch angle ±90◦ and for the roll angle from −178.2◦ to 179.0◦ . The mean of the yaw (1.9◦ ) and roll (1.0◦ ) angles are close to 0◦ , while the mean of the 5 https://data.vision.ee.ethz.ch/cvl/gfanelli/head_pose/head_forest.html, accessed 02.02.2018. 6 http://faceshift.com/studio/2015.2/, accessed 07.04.2018. 7 https://www.flickr.com/, accessed 07.04.2018. 8 https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/, accessed 22.01.2018. 8

(a) (b) (c) (d) (e) (f ) Figure 5: Examples of cropped and scaled AFLW dataset images in grayscale Figure 6: Side and frontal view of facial landmarks [31] pitch angle (−8.1◦ ) shows that this angle’s distribution is shifted to the left side. The yaw angle has a standard deviation of ±41.8◦ , the pitch angle of ±13.4◦ and the roll angle of ±14.0◦ . Figures 33, 34 and 35 in Appendix A depict the distribution of the yaw, pitch and roll angle in histograms. The head poses of the AFLW dataset were computed with the POSIT algorithm. This algorithm takes as input an initial 3D-model and 2D images annotated with facial land- marks. Koestinger et al. [31] use a 3D mean model [66] of the front face as the initial 3D-model. For pose estimation the facial landmarks of the 2D-model are fitted onto the 3D-model and the error between the 3D-model points and the 2D image points is minimized with the help of the POSIT algorithm [12]. The POSIT algorithm needs at least four non-coplanar reference points on the 3D-model, their corresponding points on the 2D-model and the focal length of the camera as an input. The POSIT algorithm itself consists of two steps. The first step is called Pose from Orthography and Scaling (POS) and computes the approximate scaled projection and the scaled orthographic projection (SOP). The rotation matrix and the translation matrix are computed out of the linear system. The two rows of the rotation matrix, i and j, and the z-coordinate of the translation object provide the angles of the pose in continuous values. The second 9

step of the algorithm consists of a few iterations to improve the approximate pose of the first step. The algorithm computes the next SOPs out of the pose of the previous step. Consequently, the feature points are shifted closer towards the correct position every time. These new SOPs are then used again as input to the POS algorithm and so on. The algorithm usually ends after four or five iterations. Koestinger et al. [31] note in their paper that the resulting poses from the POSIT algorithm are not manually verified and that they see the poses rather as a rough estimation. The angle distribution of the dataset could be an indicator for errors in the pose annotation: As described in this section, the AFLW dataset has extremely wide ranges for all angles, which supersede by far the average head movements [15] described in Section 2.1. Thus, the poses near the borders of the ranges are no realistic head orientations. It is pointed out on the website9 that the dataset is continuously improved. Via mail inquiry the administrator provided the information that the image annotation has had a major improvement at the beginning of 2012. Unfortunately the administrator did not have any further information on the improvement and to the best of my knowledge no further papers were released on that matter. The AFLW dataset is often used in combination with the AFW dataset as a benchmark in the challenging field of head pose estimation in the wild [49, 33, 55, 67, 27, 70, 1, 82]. This dataset is also used for the following experiments (Ch. 4). Annotated Faces in the Wild (AFW) dataset [82] The AFW dataset has simi- lar features as the AFLW dataset, since these images are also extracted from the image hosting website Flickr.10 It was proposed by Zhu and Ramanan to validate their face detection and head pose estimation methods with natural images [82]. The download of the dataset includes the code and proposed model of Zhu and Ramanan.11 The AFW dataset shows a wide variety of ethnicity, pose, expression, age, gender and oc- clusion. The faces are positioned in front of natural cluttered backgrounds. There are 9 https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/, accessed 22.01.2018. 10 https://www.flickr.com/, accessed 07.04.2018. 11 https://www.ics.uci.edu/~xzhu/face/, accessed 22.01.2018. 10

468 faces in 205 images. Each of the faces is labeled with a bounding box so the faces can be extracted from the images. They are manually annotated with six landmarks (centers of the eyes, tip of nose, the two corners and center of mouth) and the head pose angles pitch, yaw and roll. The yaw angle has a range from −105◦ to 90◦ , the pitch angle from −45◦ to 30◦ and the roll angle from −15◦ to 15◦ , all annotated in 15◦ steps. Compared to the Prima and AFLW dataset, the negative and positive head pose labels are interchanged. Since the yaw angle has the widest range, this angle is normally used when testing with this dataset. The yaw angle of the dataset has a mean of 2.6◦ , which shows that the dataset is centered closely around zero degrees, and a standard deviation of ±38.4◦ . The yaw’s mean and standard deviation are similar to the mean and standard deviation (1.9◦ ± 41.8◦ ) of the AFLW dataset. A visualization of the image distribution is given in Figure 42 in Appendix A. 2.3 Methods for Head Pose Estimation This section presents the most commonly used computational methods for head pose estimation (Fig. 7b-d) and the human performance in head pose estimation (Fig. 7a). The human performance is presented as a real-life reference for the computational methods. The different computational methods used for head pose estimation can be categorized in appearance-based methods (Fig. 7b), model-based methods (Fig. 7c) and nonlinear regression methods (Fig. 7d). Human performance To the best of my knowledge, Gourier et al. [22] are the only ones who measured human performance on a dataset that is also used for computational head pose estimation. For their experiment, they employ a nonlinear regression method trained on the Prima dataset and compare these results with human performance on the same dataset. Gourier et al. [22] worked with greyscale images scaled to 23 x 30 pixels. They measured the human performance with a group of 72 subjects, consisting equally of female and male subjects in the age range from 15 to 60 years. Images 11

(a) Human performance (b) Appearance-based [46] (adapted from [46]) (c) Model-based [46] (d) Nonlinear Regression [46] Figure 7: Overview of head pose methods from the Prima dataset were presented to the subjects for 7 seconds in a random order and the subjects had to choose the right yaw and pitch angle from a set of preselected answers. Gourier et al. [22] note that the psycho-physical basis for humans to estimate head poses from static images is unknown. Hence, to find out, if humans needed training to recognize head poses from static images, they split the subjects into two groups, where one group was trained to recognize head poses from the static images beforehand and the other group was not. The evaluation of the two groups shows a significant performance difference in the pitch angle but not in the yaw angle. A possible explanation is that the yaw angle provides a sufficient indication of the other person’s focus when engaging in a conversation, whereas the pitch angle is less important and even lesser, when people are sitting having their heads approximately 12

at the same height [63]. In summary, humans seem to be better in recognizing the yaw angle ranges. The results for the pitch angle show that the group with training had a 3.2◦ lower MAE than the group without training. For both angles, the front and profile angles were recognized best. Table 2 provides a comparison of the human performance and the main competing approach [49] of this thesis. Patacchiola and Cangelosi [49] employed a CNN for training, but with a higher image resolution (64 x 64 pixels) than Gourier et al. [22]. The results for the human performance (Table 2) measured by Gourier et al. [22] are the average values of the two groups. Considering the constraints of the static images and their low image resolution, the results of the human performance can only be seen as indication not as their general estimation ability. Table 2: Comparison of human performance to a nonlinear regression approach with CNNs on the Prima dataset, results are the MAEs Approach Network Yaw Pitch Gourier et al. human per- 11.9◦ 11◦ [22] formance Patacchiola based on 7.74◦ 10.57◦ and Cangelosi LeNet-5 [49] Appearance-based methods This method compares head images with already an- notated exemplary heads. The exemplary heads are annotated with discrete poses and ordered in a systematic structure. To estimate the head pose of a new image, the head’s view is compared to the pool of heads in order to find the most similar view [46] (Fig. 7b). There are different metrics for comparing a new head’s view to the set of examples. Beymer uses a mean squared error over different subportions of the image [7]. In his approach, the hierarchical feature finder first detects the two eyes and at least one nose feature in the input image. On basis of the feature positions, the pose is searched hierarchically in a coarse-to-fine strategy in the pyramid ordered head 13

pool. The possible poses need to pass a thresholding correlation test, where different subportions of the image are evaluated and thus lead to the supposedly best fitting pose. Niyogi and Freeman [47] learn to map input images non-linearly to parametric descriptions. To estimate the pose of a new input image they use a non-parametric estimation technique, the so-called Vector Quantification, which is closely related to the nearest neighbor algorithm, but with a vector as output. These appearance-based methods have the advantage of a relatively simple implemen- tation. Moreover the templates for comparing can be easily expanded at any time and therefore allow the system to be adapted easily to new conditions. The main drawback of the appearance-based methods is that the whole concept is based on the premise that similar images have also similar head poses. But when there are images with different persons displaying similar head poses, the impact of identity could outweigh the impact of a different pose, which makes template matching difficult [46]. Some researches address the problem of incorrect template matching. One approach is to use Gabor wavelets [58, 78]. The Gabor wavelets extract the global and local features from the image while preserving the structural frequency information [42]. The images are then convolved with related kernels. Wu and Trivedi [78] use the Gabor wavelets for feature detection on their input images. They use the extracted features for a nearest prototype matching by measuring the Euclidean distance. Another strategy developed to overcome this problem are Laplacian-of-Gaussian filters. For preparation, the image is first smoothed with a Gaussian kernel to decrease its noise sensitivity. The Laplacian filters then detect areas of rapid intensity change and thus can identify edges in images. The Laplacian filters bring out common facial features and diminish distinctive facial features [19]. However, due to the serious limitations of appearance-based methods, recent approaches in head pose estimation tend to use other methods. Model-based methods This method uses geometric measurements, facial key-point locations and averaged models of the face. The model-based approach is based on the assumption that humans estimate head poses from symmetry deviation. For humans, the most important indications are the movement of the head from one side to the other and the upward and downward movement of the nose [77]. 14

One common model-based approach is to estimate the head pose by computing the ge- ometric relation of facial key-points to a mapped facial model [43, 17]. It is important to choose few reliable key points which can be reused over a range of different faces. The facial model is created by averaging the structure of faces. Gee and Cipolla [17] identify the eyes, the nose and the corners of the mouth as reliable facial key-points (Fig. 7c). The symmetry axis of these five points is the connection between the mid- point line between the eyes and the center of the mouth. They estimate the head pose with the 3D-information of the nose position by assuming a fixed nose length. This is similar to the POSIT algorithm (Sec. 2.2) that operates with key-points of a 2D-model and an averaged 3D-model, where the key-points of the 2D-model are fitted onto the 3D-model to estimate the head pose. Another common model-based approach is to compare the facial key-points with an underlying geometric coordinate system instead of an averaged facial model. The head pose is estimated from the distance of the facial key-points to the reference coordinate system [39]. The drawback of this method is that it requires a very high precision for certain angles of the face in order to have an accurate head pose estimation. The overall advantage of model-based methods is that the head pose can be obtained with only a few facial key-points. The drawback is that the accuracy of the key-point detection is closely linked to the accuracy of the head pose estimation. Especially oc- clusions can interfere with the detection of facial-key points. For example, if a person wears glasses the eyes cannot be detected properly, but the position of the eyes is im- portant for head pose estimation with model-based methods. In contrast to the appearance-based approach, where the whole face is used for compar- ison, the model-based approach only uses the key-points. The model-based approach is therefore in general more sensitive to occlusions and less robust to detection failures than the appearance-based approach. Nonlinear regression methods There are different techniques to learn a nonlinear mapping from head images to pose angles (Fig. 7d). The prerequisite for good accu- racy in this method is the training with a consistent and labeled dataset of sufficient size. Examples of datasets used with nonlinear regression methods are introduced in Section 2.2. 15

In earlier years multilayer perceptrons (MLPs) and support vector regression were mainly used as nonlinear regression methods. Shiele and Waibel [57] realized in 1995 one of the first approaches with MLPs for learning the yaw angle of the head orienta- tion. They used the angle for identifying the focus of attention. Their MLP network had one hidden layer (50 neurons) and was trained with standard back-propagation, with which they achieved good results. Rainer Stiefelhagen [64] also used a MLP net- work with one hidden layer for learning the pitch and yaw angle. He achieved good results on the Prima dataset (Sec. 2.2). As another regression method, support vector regression was successfully used in combination with dimensionality reducing methods such as principal component analysis [40] and localized gradient orientation histograms [45]. Nonlinear regression with CNNs Around 2007, research papers using CNNs emerged in the field of head pose estimation. The advantage of CNNs in contrast to MLPs is their high tolerance to shift and distortion variance. One of the first ap- proaches for head pose estimation with CNNs was proposed by Osadchy et al. [48]. They describe a real-time method for simultaneous face detection and head pose esti- mation, as these two tasks are closely related. Their approach is based on energy-based models and CNNs. The CNN involved is similar to LeNet-5 [37], but has more feature maps. LeNet-5 was first published by LeCun [37], who is also one of the authors of the aforementioned paper [48]. Osadchy et al. [48] collected and annotated 30,000 im- ages under laboratory conditions and trained the yaw angle and the in-plane rotation. Considering an error of 15◦ , they mention over 80% accuracy in the yaw angle range and 95% accuracy within the in-plane rotation, when testing on standard datasets of that time. Unfortunately, they do not compare their results to other approaches. Ad- ditionally, they do not share the mean absolute error, which is a more accurate metric for comparison. In 2014, Ahn et al. [2] reported the best result until then on the Biwi Kinect head pose dataset (Sec. 2.2). They used a CNN of four convolutional layers and two fully connected layers, which did not consider the depth, but outperformed other approaches in speed and accuracy that used the depth information such as Fanelli et al. [14], who used random forest regression. 16

Other CNN approaches considered the depth information in RGB images. Mukherjee [44] and Venturelli et al. [75] use images from in-car scenarios. Mukherjee [44] trained the RGB and depth data separately on GoogleLeNet [69] models. Venturelli et al. [75] trained with a more shallow network of five convolutional layers and three fully connected layers and directly included the depth data. Both approaches used one net- work to train all three angles and tested it on the Biwi Kinect dataset for comparison. Venturelli et al. [75] outperformed Mukherjee’s [44] approach. There are three recent approaches, which tested their CNNs on the AFLW and AFW dataset [55, 33, 49] (Table 3). The first approach is from Ruiz et al. [55]. Their ar- chitecture is based on the ResNet50 architecture, combined with three mean squared error and cross entropy losses, one for each head pose angle. For testing on the AFLW dataset, they use a ResNet with 50 layers (ResNet50), which they pre-trained on the ImageNet dataset12 and finetuned on the AFLW dataset. However, it is not clear what fraction of the AFLW dataset they used for testing. They do not mention cross- validation, which makes it less comparable to this thesis approach. They also test on the AFW dataset, with a ResNet50 model trained on the synthesized 300W-LP dataset [81]. According to Ruiz et al. [55], the pre-trained ResNet50 achieves very good results on the AFW dataset. But these results are not comparable to this thesis approach (Ch. 4), where the AFW datasets are tested on networks trained on the entire AFLW dataset. Kumar et al. [33] present a novel architecture called Heatmap-CNN (H-CNN), which can learn local and global structural dependencies. They use it for learning key-points and also learn the head poses as a by-product. The H-CNN architecture contains In- ception modules [69], which are based on a similar architectural idea like the residual building blocks of the ResNet (Sec. 3.4). Kumar et al. [33] mention finetuning on the AFLW dataset, but provide no further information on the pre-training. As in the approach of Ruiz et al. [55], it is not clear what fraction of the AFLW dataset they use for testing and whether they employed cross-validation. This makes it less compa- rable to this thesis approach on the AFW dataset. Kumar et al. [33] also test their trained network on the AFW dataset. They display the result in a cumulative error 12 http://www.image-net.org/, accessed 22.01.2018. 17

distribution, which shows very good results for their approach. Unfortunately, they do not provide the result as mean absolute error (MAE), so their result on the AFW dataset is not comparable to the following experiments (Ch. 4). The last of the three recent approaches stems from Patacchiola and Cangelosi [49]. They evaluate different adaptive gradient methods in combination with four architec- tures of varied layer complexity. Patacchiola and Cangelosi [49] trained on the Prima dataset as well as on the AFWL dataset and used the AFW as a test dataset for a model trained on the whole AFLW dataset. Their best network trained on the AFLW dataset has three convolutional layers, two fully connected layers and is structurally based on the LeNet-5 architecture [37]. This work includes a reimplementation of their best network trained on the AFLW dataset (Sec. 4.3). Table 3: Overview of recent CNN approaches on the AFLW Dataset, results are the MAEs, sorted by the yaw angle AFLW Approach Network Yaw Pitch Roll Patacchiola and Can- based on LeNet-5 9.51◦ 6.8◦ 4.15◦ gelosi [49] Kumar et al. [33] Pre-trained H-CNN 6.45◦ 5.85◦ 8.75◦ Ruiz et al. [55] Pre-trained multi-loss 6.26◦ 5.89◦ 3.82◦ ResNet50 18

3 Convolutional Neural Networks (CNNs) Focusing on object recognition in images, this chapter explains the architecture and training behavior of classic CNNs and introduces various pre-processing and regular- ization techniques. Furthermore, the residual network architecture and its differences with classic CNN architectures are characterized. 3.1 Training with CNNs There are two information flows while training CNNs: One from the input to the out- put of the network called forward propagation (Sec. 3.1.1) and one from the output to the input called backpropagation (Sec. 3.1.2). The forward propagation and back- propagation are explained in the following with the help of a classic CNN architecture as shown in Figure 8. The architecture of a classic CNN includes an input layer, an output layer and several hidden layers in between (Fig. 8). The layers between the input layer and the output layer are called hidden layers, because they are not accessible from the outside of the network. The three basic types of hidden layers are the convolutional layer, the activa- tion layer and the pooling layer [20, p. 336]. Normalization layers like batch normal- ization or local response normalization are also often used as hidden layers (Sec. 3.3). The classic CNN depicted in Figure 8 has a convolutional layer as input layer, sets consisting of convolutional, activation and pooling layers as hidden layers and a fully connected layer as output layer. The output of the last layer is used for classification. The input of the CNN is a RGB image of a Samoyed dog. Because the image has three color channels (red, green and blue) the input image has a depth of three. The output class Samoyed has the highest probability, thus the image is classified correctly. The layers and their functionalities are explained in the following. 19

Figure 8: Standard CNN architecture [36] 3.1.1 Forward Propagation The images are fed into the network as 2D arrays, i. e. matrices, which contain the values of the image pixels. Most networks require that all matrices are of the same size with the height equalling the width. The convolutional layers have a number of kernels defined, the so-called filters, which convolve the 2D input to extract features. The weights of the filters are initialized at the beginning of the training and adjusted during training through backpropagation. The filters need to be of smaller size than the input, but with the same depth. They slide like a window with a defined stride over the input. Hereby, the weight stays the same and is therefore also called a shared weight [36]. The stride defines the number of pixels the filter moves horizontally or vertically in one step. A neural network can also use a bias in addition to the weights [36]. Then, a single bias with a constant value is added to all nodes of each layer. This helps to shift the activation functions to the left or right during training. The bias is also updated in backpropagation [56]. Figure 9 shows the convolution of the input matrix X (size 3 x 3) with the filter W (size 2 x 2) to output H (size 2 x 2). To simplify the representation the bias is omit- ted. As required, the filter W is smaller than the input X. The filter W slides over X with a defined stride of one pixel, as indicated with the red and blue rectangle in 20

Figure 9: Representation of convolutional layer during forward propagation (adapted from [20, p. 330]) the representation of the input X. Each pixel of the 2 x 2 pixel units of input X is multiplied with the corresponding weight of the filter W . The multiplied pixel values are added up as shown in the fields of output H. In this example, input X has a depth of one. If the input is deeper, the filter has the same depth [36]. Figure 9 depicts an example, where the input size is not preserved during the convolution. The output size is smaller, because not all pixels can be covered with the sliding filter W of stride one. Both the size of the filter and the stride affect the output size. To preserve the input size throughout convolution, it is common to use zero padding, which expands the matrix by adding zeros [20, p. 349]. One filter produces one output matrix, which is called a feature map [36]. Feature maps of lower convolutional layers detect mostly low level features like edges. Feature maps of subsequent layers can detect more complex features like combinations of edges and even combinations of features that resemble object parts. The objects are then classified as a learned combination of these parts. The layer following the convolutional layer contains an activation function. Most of these functions introduce non-linearity by applying a fixed mathematical function to each value. The graph in Figure 10 shows the hyperbolic tangent activation function tanh(x), which will also be used in the following experiments (Ch. 4). ”Hyperbolic” refers to plane geometry, where hyperbola is a special curve consisting of two symmet- 21

rical branches extending into infinity. Consequently, it is a continuous function that computes values in the range of [−1, 1]. The function is defined by the ratio of the hyperbolic sine and cosine function (Eq. 1). ex − e−x sinh(x) tanh(x) = = (1) cosh(x) ex + e−x Figure 10: Hyperbolic tangent function The pooling layer is placed after the activation layer and reduces the size of the feature map. Similar features are hereby merged into one feature [36]. Figure 11 shows the max pooling method, which picks the highest value from each position of the window. The window of size 2 x 2 slides over input X with a stride of two. The red and the blue square depict the first and second position of the window. Hence, the max value of the red square is the first value in the output matrix H. The output H is half the size of the input X, because a stride of two is used. Another widely used pooling method is average pooling, where the average value of the pixels in the respective window is calculated. 22

Figure 11: Representation of max pooling during forward propagation As the pooling layer reduces the size of the feature map, the number of filters usu- ally increases. This design principle is being introduced in the pioneering architecture LeNet-5 [36]. The VGGNet [59] uses and extends this design principle in the following ways: (1) If the size of the feature map in the VGGNet stays the same, the number of filters also remains unchanged. (2) If the size of the feature maps is halved, the num- ber of filters doubles. These design principles are also adapted by residual networks (Sec. 3.4). The fully connected layers are placed at the end of the network to identify the most important features for classifying the object. They connect every input neuron to every neuron of the next layer. This can be compared to the behavior of hidden layers in a MLP [53]. Fully connected layers need 1D arrays as input. Thus, the 2D outputs of the pooling layer are flattened into 1D arrays, which also removes the spatial relations. If more than one fully connected layer is employed, activation functions introduce non- linearity between them. The last fully connected layer is the output layer and, which size corresponds to the number of the output classes. 3.1.2 Backpropagation Backpropagation is used to minimize the computed loss, i. e. the error of the network, by updating the filter weights with the gradient information [56]. The goal is to find the global minimum with the lowest possible error. The gradient vector indicates for each weight, by how much the error of the network would de- or increase, if the weights were 23

changed. In the following experiments (Ch. 4) the sum of squares of the differences of the predicted value ŷ and the target value y is used to calculate the loss E (Eq. 2). The parameter n refers to the number of training examples in one training iteration. n X E= (yi − ŷi )2 (2) i=1 The gradients for updating the weights are computed with the chain rule. In Figure 12, the chain rule is shown exemplarily on a logic gate embedded in a circuit. The black arrows show the forward propagation and the red arrows show the backward propaga- tion. The inputs x and y are propagated forward. The function f is applied on the inputs x and y and the output z is computed. The loss is computed at the end of the network and is used to calculate the gradients from back to front by the chain rule. In ∂E the backpropagation, the gate receives as input the gradient ∂z , which defines the loss with respect to z. This gradient is used to compute the gradients of x and y by the ∂z ∂E chain rule. To compute the gradient of x, the local gradient ∂x is multiplied with ∂z . The chain rule is repeatedly applied to each gate in the circuit from back to front. Figure 12: Chain rule representation on a gate embedded in a circuit 13 In a CNN, the chain rule is repeatedly applied to compute the gradients of the weight matrix and the input matrix in the convolutional layers [20, Ch. 9]. Figure 13 shows the calculation of gradients for the weight matrix. It is the same setting as in Figure 9, but with the information flowing from back to front. G contains the gradients coming 13 Visualization adapted from: https://kratzert.github.io/2016/02/12/understanding-the- gradient-flow-through-the-batch-normalization-layer.html, accessed 04.03.2018. 24

from the end of the network, which are used to compute the gradients for the weights in W with respect to the values of input X. The gradients of the weights, displayed in this figure in filter W , are used to update the weights. The gradients for input X are computed with respect to W following the same rule and are used to propagate the error further back through the network. Figure 13: Representation of convolutional layer during backpropagation 14 Updating the weights means to minimize the error of the network. To minimize the network error, the weights are shifted in the opposite direction of the gradient (Eq. 3), as the gradient points in the direction of the greatest increase. η describes the step size of the weight update and is called the learning rate. The network is updated repeti- tively during training until the global minimum is found, the end of the defined steps is reached or another stop criterion is met. ∂E 4wij = −η (3) ∂wij There are several gradient algorithms, which improve this basic algorithm [54]. Two of them are used in the following experiments (Ch. 4): The SGDMomentum, which is a Stochastic Gradient Descent (SGD) with a momentum, and the RMSProp. The SGDMomentum method uses the momentum, because one problem possibly occur- ring during minimizing the error is the weight movement towards the global minimum 14 Visualization adapted from: https://becominghuman.ai/back-propagation-in- convolutional-neural-networks-intuition-and-code-714ef1c38199, accessed 04.03.2018. 25

getting stuck in local minima on the error surface, which resembles ravines in a 3D landscape. The momentum overcomes this problem by adding a fraction of the previous weight updates to the current one [56]. This filters out high curves on the error-surface of the weight space and influences the direction of the next weight movement, so ravines can be overstepped more easily. The SGD methods in general and also the SGDMo- mentum method are often used with a pre-defined schedule for decaying the learning rate [11, 52]. This means that the weight updates are decreased during the learning process. The RMSProp is an adaptive gradient method, which was proposed by Geoffrey Hinton in his Coursera Class.15 The group of adaptive gradient methods uses gradient infor- mation for decreasing the learning rate automatically. As shown in the Equation 4, the RMSProp uses the decayed value of all previous gradients as second order information. The current gradient E[g 2 ]t is the sum of all previous gradients E[g 2 ]t−1 , which are decayed by the parameter γ, and the current gradient gt2 . The learning rate is then p divided by E[g 2 ]t and thus decreases over time. E[g 2 ]t = γE[g 2 ]t−1 + gt2 (4) Usually, all gradient methods use the mini-batch variant to update the trainable net- work layers after n training examples. The mini-batch update rule is a combination of update rules of the Stochastic Descent (SD) and the SGD [54]. The SD computes the gradient based on the complete training dataset, while the SGD updates the network after each training example. The update rule of the SD leads to a very slow convergence towards the global minimum, as the weight is only updated after the entire training dataset. On the other hand, the update rule of the SGD can lead to a high variance during convergence, because it updates the weights after each training example. The mini-batch update rule provides a solution for both disadvantages by updating the net- work after n training examples, which results in a more stable and faster convergence with a reduced variance. 15 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, accessed 07.03.2018. 26

You can also read