Neural Transfer Learning with Transformers for Social Science Text Analysis

Page created by Samuel Garcia
 
CONTINUE READING
Neural Transfer Learning with Transformers for Social
 Science Text Analysis

 Sandra Wankmüller
 Ludwig-Maximilians-University Munich
 sandra.wankmueller@gsi.lmu.de
arXiv:2102.02111v1 [cs.CL] 3 Feb 2021

 Abstract. During the last years, there have been substantial increases in the prediction
 performances of natural language processing models on text-based supervised learning
 tasks. Especially deep learning models that are based on the Transformer architecture
 (Vaswani et al., 2017) and are used in a transfer learning setting have contributed to
 this development. As Transformer-based models for transfer learning have the potential
 to achieve higher prediction accuracies with relatively few training data instances, they
 are likely to benefit social scientists that seek to have as accurate as possible text-based
 measures but only have limited resources for annotating training data. To enable social
 scientists to leverage these potential benefits for their research, this paper explains how
 these methods work, why they might be advantageous, and what their limitations are.
 Additionally, three Transformer-based models for transfer learning, BERT (Devlin et al.,
 2019), RoBERTa (Liu et al., 2019), and the Longformer (Beltagy et al., 2020), are com-
 pared to conventional machine learning algorithms on three social science applications.
 Across all evaluated tasks, textual styles, and training data set sizes, the conventional
 models are consistently outperformed by transfer learning with Transformer-based mod-
 els, thereby demonstrating the potential benefits these models can bring to text-based
 social science research.
 Keywords. Natural Language Processing, Deep Learning, Neural Networks, Transfer
 Learning, Transformer, BERT

 1
1 Introduction
Deep learning architectures are well suited to capture the contextual and sequential
nature of language and have enabled natural language processing (NLP) researchers
to more accurately perform a wide range of tasks such as text classification, machine
translation, or reading comprehension (Ruder, 2020). Despite the fact that deep learning
techniques tend to exhibit higher prediction accuracies in text-based supervised learning
tasks compared to traditional machine learning algorithms (Budhwar et al., 2018; Iyyer
et al., 2014; Ruder, 2020), they are far from being a standard tool for social science
researchers that use supervised learning for text analysis.
Although there are exceptions (e.g. Amsalem et al., 2020; Chang and Masterson, 2020;
Muchlinski et al., 2020; Rudkowsky et al., 2018; Wu, 2020; Zhang and Pan, 2019), for
the implementation of supervised learning tasks social scientists typically resort to bag-
of-words vector space representations of texts that serve as an input to conventional
machine learning models such as support vector machines (SVMs), naive Bayes, random
forests, boosting algorithms, or regression with regularization (e.g. via the least absolute
shrinkage and selection operator (LASSO)) (see e.g. Anastasopoulos and Bertelli, 2020;
Barberá et al., 2021; Ceron et al., 2015; Colleoni et al., 2014; Diermeier et al., 2011;
D’Orazio et al., 2014; Fowler et al., 2020; Greene et al., 2019; Katagiri and Min, 2019;
Kwon et al., 2018; Miller et al., 2020; Mitts, 2019; Park et al., 2020; Pilny et al., 2019;
Ramey et al., 2019; Sebők and Kacsuk, 2020; Theocharis et al., 2016; Welbers et al.,
2017).1
One among several likely reasons why deep learning methods so far have not been widely
used for supervised learning tasks by social scientists might be that training deep learning
models is resource intensive. To estimate the exceedingly high number of parameters
that deep learning models typically comprise, large amounts of training examples are
required. For research questions relating to domains in which it is difficult to access or
label large enough amounts of training data instances, deep learning becomes infeasible
or prohibitively costly.
Recent developments within NLP on transfer learning alleviate this problem. Transfer
learning is a learning procedure in which representations learned on a source task are
transmitted to improve learning on the target task (Ruder, 2019). In sequential transfer
learning, commonly a language representation model, that comprises highly general,
close to universal representations of language, is trained by conducting the source task
(Ruder, 2019). Using these learned general-purpose representations as inputs to target
 1
 This is not to say that social scientists would not have started to leverage the foundations of deep
learning approaches in NLP: During the last years, the use of real-valued vector representations of terms,
known as word embeddings, enabled social scientists to explore new research questions or to study old
research questions by new means (Han et al., 2018; Kozlowski et al., 2019; Rheault et al., 2016; Rheault
and Cochrane, 2020; Rodman, 2020; Watanabe, 2020). Yet, the analysis of social science relevant texts
by means of deep learning up til now mostly is conducted by research teams that are not primarily social
science trained (e.g. Abercrombie and Batista-Navarro, 2018; Ansari et al., 2020; Budhwar et al., 2018;
Glavaš et al., 2017; Iyyer et al., 2014; Zarrella and Marsh, 2016).

 2
tasks has been shown to improve the prediction performances on a wide range of these
NLP target tasks (Ruder, 2020). Moreover, adapting the language representations to a
target task requires fewer target training examples than when not using transfer learning
and directly training the model on the target task (Howard and Ruder, 2018). Thus,
research on transfer learning in NLP recently has made possible a new mode of learning
in which ready-to-use language representation models, that are already pretrained on a
source task, can be used off-the-shelf and be easily adapted with few resources to any
target task of interest (Alammar, 2018a).
In addition to the efficiency and performance gains from research on transfer learning,
the introduction of the attention mechanism (Bahdanau et al., 2015), which became
incorporated in the Transformer architecture presented by Vaswani et al. in 2017, has
significantly improved the ability of deep learning NLP models to capture contextual
information from texts. During the last years, several Transformer-based models that
are used in a transfer learning setting have been introduced (e.g. Devlin et al., 2019;
Liu et al., 2019; Yang et al., 2019). These models substantively outperform previous
state-of-the-art models across a large variety of NLP tasks (Ruder, 2020).
Due to the likely increases in prediction accuracy as well as the efficient and less resource-
ful adaptation phase, transfer learning with deep (e.g. Transformer-based) language rep-
resentation models seems promising to social science researchers that seek to have as
accurate as possible text-based measures but may lack the resources to annotate large
amounts of data or are interested in specific domains in which only small corpora and
few training instances exist. In order to equip social scientists to use the potentials of
transfer learning with Transformer-based models for their research, this paper provides
a thorough but focused introduction to transfer learning and the Transformer.
The next two sections compare conventional to deep learning approaches and introduce
basic concepts and methods of deep learning. These two sections lay the groundwork
for the two following sections that present transfer learning and the Transformer ar-
chitecture. The section on transfer learning provides an answer to the question what
transfer learning is and explains in more detail in what ways transfer learning might
be beneficial. The subsequent section then introduces the attention mechanism and
the Transformer and elaborates on how the Transformer has advanced the study of text.
Then, an overview over Transformer-based models for transfer learning is provided. Here
a special focus will be given to the seminal Transformer-based language representation
model BERT (standing for Bidirectional Encoder Representations from Transformers)
(Devlin et al., 2019). Afterward, guidance on deep learning and transfer learning in
practice is offered. Finally, three Transformer-based models for transfer learning, BERT
(Devlin et al., 2019), RoBERTa (Liu et al., 2019), and the Longformer (Beltagy et al.,
2020), are compared to traditional learning algorithms based on three classification tasks
using data from speeches in the UK parliament (Duthie and Budzynska, 2018), tweets
regarding the legalization of abortion (Mohammad et al., 2017), and comments from
Wikipedia Talk pages (Jigsaw/Conversation AI, 2018). The final section concludes with
a discussion on task-specific factors and research goals for which neural transfer learning

 3
with Transformers is highly beneficial vs. rather limited.

2 Comparing Conventional Machine Learning to Deep Learn-
 ing
Starting with core machine learning concepts, this section works out the differences that
conventional machine learning models vs. deep learning models imply for text analysis.
Together with the next section, this section serves as the basis for the subsequently
provided introductions to transfer learning and Transformer-based models.

2.1 Machine Learning
Given raw input data X = {d1 , . . . , dn , . . . , dN } (e.g. a corpus comprising N raw text
files) and a corresponding output variable y = {y1 , . . . , yn , . . . , yN }, the aim in supervised
machine learning is to search the space of possible mappings between X and y to find the
parameters θ of a function such that the estimated function’s predictions, ŷ = f (X , θ̂),
are maximally close to the true values y and thus—given new, yet unseen data Xtest —the
trained model will generate accurate predictions (Chollet, 2020). The distance between
y and f (X , θ̂) is measured by a loss function, L(y, f (X , θ̂)), that guides the search
process (Chollet, 2020).
Learning in supervised machine learning hence means finding f (X , θ̂) and essentially is
a two-step process (Goodfellow et al., 2016): The first step is to learn representations of
the data, fR (X , θ̂R ), and the second is to learn mappings from these representations of
the data to the output.

 ŷ = f (X , θ̂) = fO (fR (X , θ̂R ), θ̂O ) (1)

2.2 Conventional Machine Learning
Conventional machine learning algorithms cover the second step: They learn a func-
tion mapping data representations to the output. This in turn implies that the first
step falls to the researcher who has to (manually) generate representations of the data
herself.
When working with texts, the raw data X typically are a corpus of text documents. A
very common approach then is to transform the raw text files via multiple preprocessing
procedures into a document-feature matrix X = {x1 , . . . , xn , . . . , xN }> (see Figure 1).
In a document-feature matrix each document is represented as a feature vector xn =
{xn1 , . . . , xnk , . . . , xnK } (Turney and Pantel, 2010). Element xnk in this vector gives
the value of document n on feature k—which typically is the (weighted) number of
times that the kth textual feature occurs in the nth document (Turney and Pantel,
2010). To conduct the second learning step, the researcher then commonly applies a
conventional machine learning algorithm on the document-feature matrix to learn the

 4
I agree :-) confess this …
 I agree :-) 1 1 1 0 0 … 1
 I confess this is a problem. 1 0 0 1 1 … 0
 These people need help – now! 0 0 0 0 0 … 1
 Come on! This is nonsense! 0 0 0 0 1 … 0
 … … … … … … … …

 0 1 2
 raw data representations of the data output
 e.g. corpus e.g. document-feature matrix

Figure 1: Learning as a Two-Step Process. Learning in machine learning essentially is
a two-step process. In text-based applications of conventional machine learning approaches, the
raw data X first are (manually) preprocessed such that each example is represented as a feature
vector in the document-feature matrix X. Second, these feature representations of the data are
fed as inputs to a traditional machine learning algorithm that learns a mapping between data
representations X and outputs y.

relation between the document-feature representation of the data X and the provided
response values y.
There are two difficulties with this approach. The first is that it may be hard for the
researcher to a priori know which features are useful for the task at hand (Goodfellow
et al., 2016). The performance of a supervised learning algorithm will depend on the
representation of the data in the document-feature matrix (Goodfellow et al., 2016).
In a classification task, features that capture observed linguistic variation that helps in
assigning the texts into the correct categories are more informative and will lead to a
better classification performance than features that capture variation that is not helpful
in distinguishing between the classes (Goodfellow et al., 2016). Yet, determining which
sets of possibly highly abstract and complex features are informative and which are not
is highly difficult (Goodfellow et al., 2016): A researcher can choose from a multitude of
possible preprocessing steps such as stemming, lowercasing, removing stopwords, adding
part-of-speech (POS) tags, or applying a sentiment lexicon.2 The set of selected pre-
processing steps as well as the order in which they are implemented define the way in
which the texts at hand are represented and thus affect the research findings (Denny and
Spirling, 2018; Goodfellow et al., 2016). Social scientists may be able to use some of their
domain knowledge in deciding upon a few specific preprocessing decisions (e.g. whether
it is likely that excluding a predefined list of stopwords will be beneficial because it re-
 2
 For a more detailed list of possible steps see Turney and Pantel (2010) and Denny and Spirling
(2018).

 5
duces dimensionality or will harm performance because the stopword list includes some
terms that are important). Domain knowledge, however, is most unlikely to guide re-
searchers regarding all possible permutations of preprocessing steps. Simply trying out
each possible preprocessing permutation in order to select the best performing one for
a supervised task is not possible given the massive number of permutations and limited
researcher resources.
The second problem is that in a document-feature matrix each document is represented
as a single vector xn —which implies that each document is treated as a bag-of-words
(Turney and Pantel, 2010). Bag-of-words representations disregard word order and syn-
tactic or semantic dependencies between words in a sequence (Turney and Pantel, 2010).3
Yet, text is contextual and sequential by nature. Word order carries meaning. And the
context in which a word is embedded in is essential in determining the meaning of a
word. When represented as a bag-of-words, the sentence ‘The opposition party leader
attacked the prime minister.’ cannot be distinguished from the sentence ‘The prime
minister attacked the opposition party leader.’. Moreover, the fact that the word ‘party’
here refers to a political party rather than a festive social gathering only becomes clear
from the context. Although bag-of-words models perform relatively well given the sim-
ple representations of text they build upon (Grimmer and Stewart, 2013), it has been
shown that capturing contextual and sequential information is likely to enhance predic-
tion performances (see for example Socher et al., 2013).

2.3 Deep Learning
In contrast to conventional machine learning algorithms, deep learning models conduct
both learning steps: they learn representations of the data and a function mapping data
representations to the output.
Deep learning is one branch of representation learning in which features are not designed
and selected manually but are learned by a machine learning algorithm (Goodfellow
et al., 2016). In deep learning models, an abstract representation of the data is learned
by applying the data to a stack of several simple functions (Goodfellow et al., 2016).
Each function takes as an input the representation of the data created by (the sequence
of) previous functions and generates a new representation:

 f (X , θ̂) = fO (. . . fR3 (fR2 (fR1 (X , θ̂R1 ), θ̂R2 ), θ̂R3 ) . . . , θ̂O ) (2)

The first function fR1 takes as an input the data and provides as an output a first
representation of the data. The second function fR2 takes as an input the representation
learned by the first function and generates a new representation and so on. Deep learning
 3
 By counting the occurrence of word sequences of length N , N -gram models extend unigram-based
bag-of-words models and allow for capturing information from small contexts around words. Yet, by
including N -grams as features the dimensionality of the feature space increases, thereby increasing the
problem of high dimensionality and sparsity. Moreover, texts often exhibit dependencies between words
that are positioned much farther apart than what could be captured with N -gram models (Chang and
Masterson, 2020).

 6
models thus are characterized by providing a layered representation of the data in which
each layer of representation is based on previous data representations (Goodfellow et al.,
2016). Hereby, complex and abstract representations are learned from less abstract,
more simple ones (Goodfellow et al., 2016).
As they are composed of nested layers of functions, deep learning models are called
deep (Goodfellow et al., 2016). The representation layers, {R1 , R2 , R3 , . . . }, are named
hidden layers (Goodfellow et al., 2016). The dimensionality of the vector-valued hidden
layer outputs is a model’s width and the number of hidden layers is the depth of a
model (Goodfellow et al., 2016). Deep learning models differ in their architecture. Most
types of models, however, are based on a chain of functions as specified in equation 2
(Goodfellow et al., 2016).
Deep learning models do learn representations of the data. Yet, when applying them
to text-based applications they do not take as an input the raw text documents. They
still have to be fed with a data format they can read. Rather than taking as an input
a raw text document dn = {a1 , . . . , at , . . . , aT }, which is a sequence of words (or more
precisely: a sequence of tokens), deep learning models take as an input a sequence of
word embeddings {z[a1 ] , . . . , z[at ] , . . . , z[aT ] }.
Word embeddings are real-valued vector representations of the unique terms in the vo-
cabulary (Mikolov et al., 2013c). The vocabulary Z consists of the U unique terms in
the corpus: Z = {z1 , . . . , zu , . . . , zU }. Each vocabulary term zu can be represented as
a word embedding—a K-dimensional real-valued vector zu ∈ RK . And as each token
at is an instance of one of the unique terms in the vocabulary, each token has a corre-
sponding embedding. A document {a1 , . . . , at , . . . , aT } consequently can be mapped to
{z[a1 ] , . . . , z[at ] , . . . , z[aT ] }.
A deep learning model thus is fed with a sequence of real-valued vectors instead of a
sequence of tokens. The values of the word embedding vectors are parameters that are
learned in the optimization process. Hence, the representation for each unique textual
token is learned when training the model. Except for the decision of how to separate
a continuous text sequence into tokens (and sometimes also whether to lowercase the
tokens and consequently to only learn embeddings for lowercased tokens), deep learning
models learn representations of the data rather than operating on manually tailored rep-
resentations. Also note that deep learning NLP models—instead of learning one vector
representation for each term—learn one vector per term in each hidden layer.
As deep learning NLP models learn to represent a raw sequence of tokens as a sequence
of vectors instead of taking it as a bag-of-words, deep learning models can learn depen-
dencies between tokens and take into account contextual information. Deep learning
architectures as recurrent neural networks (RNNs) (Elman, 1990) and the Transformer
(Vaswani et al., 2017) are especially suited to capture dependencies between sequen-
tial input data and are able to encode information from the textual context a token is
embedded in.

 7
2.4 Well-Performing and Efficient Models for NLP Tasks
The no free lunch theorem (Wolpert and Macready, 1997) states that if averaging over
all possible classification tasks every classification algorithm will achieve the same gen-
eralization performance (Goodfellow et al., 2016). This theorem implies that there is
no universally superior machine learning algorithm (Goodfellow et al., 2016). A specific
algorithm will perform well on one task but less well on another.
Hence, given a type of learning task, researchers ideally should develop and use machine
learning algorithms that are especially suited to conduct this type of task. This is,
researchers should employ algorithms that are good at approximating the functions
that map from feature inputs to provided outputs in the real-world applications they
encounter (Goodfellow et al., 2016).
In NLP tasks, models have to learn mappings between textual inputs and task-specific
outputs. Common NLP tasks are binary or multi-class classification tasks in which the
model’s task is to assign one out of two or one out of several class labels to each text
sequence. Other NLP tasks, for example, require the model to answer multiple choice
questions about provided input texts or to make predictions about the entailment of a
sentence pair (Wang et al., 2019b).
Due to their layered architecture and the vector-valued representations deep learning
models tend to have a high capacity. This is, they can approximate a large variety of
complex functions (Goodfellow et al., 2016). On less complex data structures, large deep
learning models may risk overfitting and conventional machine learning approaches with
lower expressivity may be more suitable. The ability to express complicated functions
as well as the ability to automatically learn representations and the ability to encode
information on connections between sequential inputs, however, seems essential when
working with textual data: Empirically, deep learning models outperform classic machine
learning algorithms on nearly all NLP tasks (Ruder, 2020).
Despite these advantages, deep learning methods so far are not widely used for super-
vised learning tasks within the quantitative text analysis community in social science.
Some studies do use deep learning models (e.g. Amsalem et al., 2020; Chang and Mas-
terson, 2020; Muchlinski et al., 2020; Rudkowsky et al., 2018; Wu, 2020; Zhang and Pan,
2019). Yet, until now social scientists typically approach supervised classification tasks
by generating bag-of-words based representations of texts that then are passed on to
conventional machine learning algorithms as SVMs, naive Bayes, regularized regression,
or tree-based methods (see e.g. Anastasopoulos and Bertelli, 2020; Barberá et al., 2021;
Ceron et al., 2015; Colleoni et al., 2014; Diermeier et al., 2011; D’Orazio et al., 2014;
Fowler et al., 2020; Greene et al., 2019; Katagiri and Min, 2019; Kwon et al., 2018; Miller
et al., 2020; Mitts, 2019; Park et al., 2020; Pilny et al., 2019; Ramey et al., 2019; Sebők
and Kacsuk, 2020; Theocharis et al., 2016; Welbers et al., 2017).
There are several factors that might have contributed to the sporadic rather than
widespread use of deep learning models within text-based social science. One likely

 8
reason is that deep learning models have considerably more parameters to be learned
in training than classic machine learning models. Consequently, deep learning models
are computationally highly intensive and require substantially larger amounts of train-
ing examples. How much more training data are needed depends on the width and
depth of the model, the task, and training data quality. Thus, precise numbers on the
amounts of parameters and required training examples cannot be specified. To never-
theless put the sizes in relation note that a SVM with a linear kernel that learns to
construct a hyperplane in a 3000-dimensional feature space which separates instances
into two categories based on 1000 support vectors has around 3 million parameters. The
Transformer-based models presented in this article, in contrast, have well above 100
million parameters.
Recent developments of deep language representation models for transfer learning, how-
ever, reduce the amount of training instances needed to achieve the same level of per-
formance than when not using transfer learning (Howard and Ruder, 2018). The intro-
duction of the Transformer (Vaswani et al., 2017) additionally has improved the study
of text. Due to its self-attention mechanisms the Transformer is better able to encode
contextual information and dependencies between tokens than previous deep learning
architectures (Vaswani et al., 2017). To enable social science research to leverage these
potentials, this study presents and explains the workings and usage of Transformer mod-
els for transfer learning.

3 Introduction to Deep Learning
This section provides an introduction to the basics of deep learning and thereby lays
the foundation for the following sections. First, based on the example of feedforward
neural networks (FNNs) the core elements of neural network architectures are explicated.
Then, the optimization process via stochastic gradient descent with backpropagation
(Rumelhart et al., 1986) will be presented. Subsequently, the architecture of recurrent
neural networks (RNNs) (Elman, 1990) is outlined.

3.1 Feedforward Neural Network
The most elementary deep learning model is a feedforward neural network (FNN), also
named multilayer perceptron (Goodfellow et al., 2016). A feedforward neural network
with L hidden layers, vector input x and a scalar output y can be visualized as in Figure
2 and be described as follows (Ruder, 2019, see also):

 h1 = σ1 (W1 x + b1 ) (3)

 h2 = σ2 (W2 h1 + b2 ) (4)
 ...
 hl = σl (Wl hl−1 + bl ) (5)

 9
...
 y = σO (wO hL + bO ) (6)

The input to the neural network is the K0 -dimensional vector x (see equation 3). x
enters an affine function characterized by weight matrix W1 and bias vector b1 , whereby
W1 ∈ RK1 ×K0 , and b1 ∈ RK1 . σ1 is a nonlinear activation function and h1 ∈ RK1
is the K1 -dimensional representation of the data in the first hidden layer. This is,
the neural networks takes the input data x and via combining an affine function with a
nonlinear activation function generates a new, transformed representation of the original
input: h1 . The hidden state h1 in turn serves as the input for the next layer that
produces representation h2 ∈ RK2 . This continues through the layers until the last
hidden representation, hL ∈ RKL , enters the output layer (see equation 6).

 " ! $

 ! ℎ!! ℎ%! ℎ&!

 " ℎ!" ℎ%" ℎ&"
 
 # ℎ!# ℎ%# ℎ&#

 $ ℎ!$ ℎ%$ ℎ&$

 " #

Figure 2: Feedforward Neural Network. Feedforward Neural Network with L hidden
layers, four units per hidden layer and scalar output y. The solid lines indicate the linear
transformations encoded in weight matrix W1 . The dotted lines indicate the connections between
several consecutive hidden layers.

The activation functions in neural networks typically are chosen to be nonlinear (Good-
fellow et al., 2016). The reason is that if the activation functions were set to be linear,
the output of the neural network would merely be a linear function of x (Goodfellow
et al., 2016). Hence, the use of nonlinear activation functions is essential for the ca-
pacity of neural networks to approximate a wide range of functions and highly complex
functions (Ruder, 2019).
In the hidden layers, the rectified linear unit (ReLU) (Nair and Hinton, 2010) is often
used as an activation function σl (Goodfellow et al., 2016). If q = {q1 , . . . , qk , . . . , qK }
is the K-dimensional vector resulting from the affine transformation in the lth hidden
layer, q = Wl hl−1 + bl (see equation 5), then ReLU is applied on each element qk :
 σl (q)k = max{0, qk } (7)

 10
σl (q)k then is the kth element of hidden state vector hl .4
In the output layer, the activation function σO is selected so as to produce an output that
matches the task-specific type of output values. In binary classification tasks with yn ∈
{0, 1} the standard logistic function, often simply referred to as the sigmoid function,
is a common choice (Goodfellow et al., 2016). For a single observational unit n, the
sigmoid function’s scalar output value is interpreted as the probability that yn = 1.
If yn , however, can assume one out of C unordered response category values, yn ∈
{1, . . . , c, . . . , C}, the softmax function, a generalization of the sigmoid function that
takes as an input and produces as an output a vector of length C, is typically employed
(Goodfellow et al., 2016). For the nth example, the cth element of the softmax output
vector gives the probability predicted by the model that unit n falls into class c.

3.2 Optimization: Gradient Descent with Backpropagation
In supervised learning tasks, a neural network is provided with input xn and correspond-
ing output yn for each training example. All the weights and bias terms are parameters
to be learned in the process of optimization (Goodfellow et al., 2016). The set of pa-
rameters hence is θ = {W1 , . . . , Wl , . . . , WL , WO , b1 , . . . , bl , . . . , bL , bO }.
For a single training example n, the loss function L(yn , f (xn , θ̂)) measures how well the
value predicted for n by model f (xn , θ̂), that is characterized by the estimated parameter
set θ̂, matches the true value yn . In the optimization process the aim is to find the set of
values for the weights and biases that minimizes the average of the observed losses over all
training set instances, also known as the empirical risk: R(θ̂) = N1 N
 P
 n=1 L(y n , f (xn , θ̂))
(Goodfellow et al., 2016).
Neural networks commonly employ variants of gradient descent with backpropagation in
the optimization process (Goodfellow et al., 2016). To approach the local minimum of
the empirical risk function, the gradient descent algorithm makes use of the fact that the
direction of the negative gradient of function R at current point θ̂i gives the direction in
which R is decreasing fastest—the direction of the steepest descent (Goodfellow et al.,
2016). The gradient is a vector of partial derivatives. It is the derivative of R at point
θ̂i and is commonly denoted as ∇θ̂i R(θ̂i ).
So, in the ith iteration, the gradient descent algorithm computes the negative gradient
of R at current point θ̂i and then moves from θ̂i into the direction of the negative
gradient:
 θ̂i+1 = θ̂i − η∇θ̂i R(θ̂i ) (8)

whereby η ∈ R+ is the learning rate. If η is small enough, then R(θ̂i ) ≥ R(θ̂i+1 ) ≥
R(θ̂i+2 ) ≥ . . . . This is, repeatedly computing the gradient and updating into the
 4
 Activation functions that are similar to ReLU are the Exponential Linear Unit (ELU) (Clevert et al.,
2016), Leaky ReLU and the Gaussian Error Linear Unit (GELU) (Hendrycks and Gimpel, 2016). The
latter is used in BERT (Devlin et al., 2019).

 11
direction of the negative gradient with small enough learning rate η, will generate a
sequence moving toward the local minimum (Li et al., 2020a).
In each iteration, the gradients for all parameters are computed via the backpropagation
algorithm (Rumelhart et al., 1986).5 A very frequently employed approach, known as
mini-batch gradient descent or stochastic gradient descent, is to compute the gradients
based on a small random sample, a mini-batch, of M training set observations:
 M
 1 X
 ∇θ̂i R(θ̂i ) = ∇θ̂i L(ym , f (xm , θ̂i )) (9)
 M
 m=1

The gradient computed on a sample of training instances usually provides a good ap-
proximation of the loss function gradient evaluated on the entire training set (Goodfellow
et al., 2016). It requires fewer computational resources and thus leads to faster conver-
gence (Goodfellow et al., 2016). M typically assumes a value in the range from 2 to a
few hundred (Ruder, 2019).
The size of the mini-batch M and the learning rate η are hyperparameters in training
neural network models. Especially the learning rate is often carefully tuned (Li et al.,
2020a). A too high learning rate leads to large oscillations in the loss function values,
whereas a too low learning rate implies slow convergence and risks getting stuck at a non-
optimal point with a high loss value (Goodfellow et al., 2016). Commonly, the learning
rate is not kept constant but is set to decrease over time (Goodfellow et al., 2016).
Furthermore, there are variants of stochastic gradient descent, as AdaGrad (Duchi et al.,
2011), RMSProp (Hinton et al., 2012), and Adam (Kingma and Ba, 2015), that have a
varying learning rate for each individual parameter (Goodfellow et al., 2016).

3.3 Recurrent Neural Networks
The recurrent neural network (RNN) (Elman, 1990) is the most basic neural network
to process sequential input data of variable length such as texts (Goodfellow et al.,
2016). Given an input sequence of T token embeddings {z[a1 ] , . . . , z[at ] , . . . , z[aT ] }, RNNs
sequentially process each token. Here, one input token embedding z[at ] corresponds to
one time step t and the hidden state ht is updated at each time step. At each step t,
the hidden state ht is a function of the hidden state generated in the previous time step,
ht−1 , as well as new input data, z[at ] (see Figure 3) (Amidi and Amidi, 2019; Elman,
1990).
The hidden states ht , that are passed on and transformed through time, serve as the
model’s memory (Elman, 1990; Ruder, 2019). They capture the information of the se-
quence that entered until t (Goodfellow et al., 2016). Due to this sequential architecture,
RNNs theoretically can model dependencies over the entire range of an input sequence
 5
 The backpropagation algorithm makes use of the chain rule to compute the gradients. Helpful
introductions to the backpropagation algorithm can be found in Li et al. (2020a), Li et al. (2020b) and
Hansen (2020).

 12
"! "" "#$! "#

 ℎ! ℎ" … ℎ#$! ℎ# …

 #[&! ] #[&" ] #[&#$! ] #[&# ]

Figure 3: Recurrent Neural Network. Architecture of a basic recurrent neural network
unfolded through time. At time step t, the hidden state ht is a function of the previous hidden
state, ht−1 , and current input embedding z[at ] . yt is the output produced at t.

(Amidi and Amidi, 2019). Yet, practically recurrent models have problems in learning
dependencies extending beyond sequences of 10 or 20 tokens (Goodfellow et al., 2016).
The reason is that when backpropagating the gradients through the time steps (this is
known as Backpropagation Through Time (BPTT)), the gradients may vanish and thus
fail to transmit a signal over long ranges (Goodfellow et al., 2016).
The Long Short-Term Memory (LSTM) model (Hochreiter and Schmidhuber, 1997) ex-
tends the RNN with input, output, and forget gates that enable the model to accumulate,
remember as well as forget provided information (Goodfellow et al., 2016). This makes
them better suited than the basic RNNs to model dependencies stretching over long time
spans as found in textual data (Ruder, 2019).

4 Transfer Learning
Transformer-based models for transfer learning fuse two lines of recent NLP research:
transfer learning and attention. Both concepts and respective developments will be
outlined in the following two sections such that after reading these sections the reader will
not only be prepared to understand the workings of transfer learning with Transformer-
based models like BERT but also will be informed about latest major developments
within NLP. This first section defines transfer learning and describes the benefits and
uses of transfer learning.

4.1 A Taxonomy of Transfer Learning
The classic approach in supervised learning is to have a training data set containing a
large number of annotated instances, {xn , yn }N
 n=1 , that are provided to a model that
learns a function relating the xn to the yn (Ruder, 2019). If the train and test data
instances have been drawn from the same distribution over the feature space, the trained

 13
model can be expected to make accurate predictions for the test data, i.e. to generalize
well (Ruder, 2019). Given another task (e.g. another set of labels to learn) or another
domain (e.g. another set of documents with a different thematic focus), the standard
supervised learning procedure would be to sample and create a new training data set for
this new task and domain (Pan and Yang, 2010; Ruder, 2019). This is, for each new task
and domain a new model is trained from the start (Pan and Yang, 2010; Ruder, 2019).
There is no transferal of already existing, potentially relevant and useful information
from related domains or tasks to the task at hand (Ruder, 2019).
Trained supervised learning models thus are not good at generalizing to data exhibit-
ing characteristics different from the data they have been trained on (Ruder, 2019).
Moreover, the (manual) labeling of thousands to millions of training instances for each
new task makes supervised learning highly resource intensive and prohibitively costly to
be applied for all potentially useful and interesting tasks (Pan and Yang, 2010; Ruder,
2019). In situations in which the number of annotated training examples is restricted or
the researcher lacks the resources to label a sufficiently large number of training instances
classic supervised learning fails (Ruder, 2019).
This is where transfer learning comes in. Transfer learning refers to statistical learning
procedures in which knowledge learned by training a model on a related task and domain,
the source task and source domain, is transferred to the learning process of the target
task in the target domain (Pan and Yang, 2010; Ruder, 2019).
Ruder (2019) provides a taxonomy of transfer learning scenarios in NLP. In transduc-
tive transfer learning, the source and target tasks are the same but annotated training
examples only are available for the source domain (Ruder, 2019). Here, knowledge is
transferred across domains (domain adaptation) or languages (cross-lingual learning)
(Ruder, 2019). In inductive transfer learning source and target tasks differ but the
researcher has access to at least some labelled training samples in the target domain
(Ruder, 2019). In this setting, tasks can be learned simultaneously (multi-task learning)
or sequentially (sequential transfer learning) (Ruder, 2019).

4.2 Sequential Transfer Learning
In this article the focus is on sequential transfer learning, which at present is the most
frequently employed type of transfer learning (Ruder, 2019). In sequential transfer learn-
ing the source task differs from the target task and training is conducted in a sequential
manner (Ruder, 2019). Here, two stages are distinguished (see also Figure 4): First, a
model is trained on the source task (pretraining phase) (Ruder, 2019). Subsequently,
the knowledge gained in the pretraining phase is transmitted to the model trained on
the target task (adaptation phase) (Ruder, 2019). In NLP, the knowledge that is trans-
ferred are the parameter values learned during training the source model—and thus also
includes the values of the token representation vectors. The source model thus by some
authors is also referred to as a language representation model (see e.g. Devlin et al.,
2019).

 14
!!"#$%& , #!"#$%&
 We laughed out loud [EOS]
 He is eating an apple [EOS]
 She told me to come [EOS]
 She nodded [EOS]

 New York woke up [EOS]
 The sun shone bright ##ly [EOS] !'($)&' , #'($)&'

 $!"#$%& $'($)&'

 %&'(&)*+*+,
 $'($)&'

 )-)%()(*.+

Figure 4: Pretraining and Adaptation Phase in Sequential Transfer Learning.
In the pretraining phase of sequential transfer learning a model is trained on the source task
{Xsource , ysource }. The representations (i.e. the parameters) learned in the pretraining phase
then are taken to the learning process of the target task, {Xtarget , ytarget }. A once pretrained
model can serve as an input for a large variety of target tasks.

 15
The common procedure in sequential transfer learning in NLP is to select a source task
that will learn very general, close to universal representations of language in the pre-
training phase that then provide an effective, well-generalizing input for a large spectrum
of specific target tasks (Ruder, 2019). As many training instances are required to learn
such general language representations, training a source model in the sequential trans-
fer learning setting is highly expensive (Ruder, 2019). Yet, adapting a once pretrained
model to a target task is often fast and cheap as transfer learning procedures require
only a small proportion of the annotated target data required by standard supervised
learning procedures to achieve the same level of performance (Howard and Ruder, 2018).
In Howard and Ruder (2018), for example, training the deep learning model ULMFiT
(short for Universal Language Model Fine-Tuning) from scratch on the target task re-
quires 5 to 20 times more labeled training examples to reach the same error rate than
using ULMFiT for transfer learning.
Beside being more efficient, transfer learning also tends to increase prediction perfor-
mances: Starting with the work by Yosinski et al. (2014) and Donahue et al. (2014), the
computer vision community discovered that when the weights that had been learned by
training deep learning models on the ImageNet Dataset (a data set containing annota-
tions for more than 14 million images (Deng et al., 2009)) are used as pretrained inputs
to other computer vision target tasks, this substantially increases the prediction perfor-
mances on the target tasks—even if only few target training instances are used (Ruder,
2019). In computer vision, sequential transfer learning with pretraining on the ImageNet
Dataset has proven so effective that it has been established as a widely used learning
procedure for several years now (Ruder, 2019). Also in NLP, transfer learning based
on models as ULMFiT (Howard and Ruder, 2018), OpenAI GPT (standing for Gener-
ative Pretrained Transformer) (Brown et al., 2020; Radford et al., 2018, 2019), BERT
(Devlin et al., 2019), and XLNet (Yang et al., 2019) have led to substantial increases in
prediction performances in recent years (Ruder, 2019).
The smaller the target task training data set size, the more salient the pretrained rep-
resentations become. When decreasing the number of target task training set instances,
the prediction performance of deep learning models that are trained from scratch on the
target task declines (Howard and Ruder, 2018). For deep neural networks that are used
in a transfer learning setting and receive well-generalizing representations from pretrain-
ing, in contrast, prediction performance levels on the target task decrease more slowly
and slightly with diminishing numbers of training set instances (Howard and Ruder,
2018). So, transfer learning is likely to enhance prediction performances on a target
task. These enhancements are likely to be especially strong for medium-sized and small
target task training data sets (Howard and Ruder, 2018).

4.3 Pretraining
The discoveries with the ImageNet Dataset suggest that in order to learn general, close
to universal representations, that are relevant for a wide range of tasks within an entire
discipline, two things are required: a pretraining data set that contains a large amount

 16
of training samples and is representative of the feature distribution studied across the
discipline and a suitable pretraining task (Ruder, 2018, 2019).

4.3.1 Pretraining Tasks
The most fundamental pretraining approaches in NLP are self-supervised (Ruder, 2019).
Among these, a very common pretraining task is language modeling (Bengio et al., 2003).
A language model learns to assign a probability to a sequence of tokens (Bengio et al.,
2003). As the probability for a sequence of T tokens, P (a1 , . . . , at , . . . , aT ), can be
computed as
 T
 Y
 P (a1 , . . . , at , . . . , aT ) = P (at |a1 , . . . , at−1 ) (10)
 t=1
or as
 1
 Y
 P (a1 , . . . , at , . . . , aT ) = P (at |aT , . . . , at+1 ) (11)
 t=T

language modeling involves predicting the conditional probability of token at given all
its preceding tokens, P (at |a1 , . . . , at−1 ), or implicates predicting the conditional prob-
ability of token at given all its succeeding tokens, P (at |aT , . . . , at+1 ) (Bengio et al.,
2003; Yang et al., 2020). A forward language model models the probability in equation
10, a backward language model computes the probability in equation 11 (Peters et al.,
2018).
When being trained on a forward and/or backward language modeling task in pre-
training, a model is likely to learn general structures and aspects of language, such as
long-range dependencies, compositional structures, semantics and sentiment, that are
relevant for a wide range of possible target tasks (Howard and Ruder, 2018; Ruder,
2018). Hence, language modeling can be argued to be a well-suited pretraining task
(Howard and Ruder, 2018).

4.3.2 Pretraining Data Sets
In contrast to computer vision where pretraining on the ImageNet Dataset is ubiquitous,
there is no standard pretraining data set in NLP (Aßenmacher and Heumann, 2020;
Ruder, 2019). The text corpora that have been employed for pretraining vary widely
regarding the number of tokens they contain as well as their accessibility (Aßenmacher
and Heumann, 2020).6 Most models are trained on a combination of different corpora.
Several models (e.g. Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019; Yang et al.,
2019) use the English Wikipedia and the BooksCorpus Dataset (Zhu et al., 2015). Many
models (e.g. Brown et al., 2020; Liu et al., 2019; Radford et al., 2019; Yang et al., 2019)
additionally also use pretraining corpora made up of web documents obtained from
crawling the web.
 6
 A detailed and systematic overview over these data sets is provided by Aßenmacher and Heumann
(2020).

 17
Note that whereas pretraining is not standardized, there are common benchmark data
sets that are used when it comes to comparing the performance of different learning
models on NLP target tasks. Widely utilized benchmark data sets, that together cover a
large variety of NLP tasks, are the General Language Understanding Evaluation (GLUE)
(Wang et al., 2019b), SuperGLUE (Wang et al., 2019a), the Stanford Question Answer-
ing Dataset (SQUAD) (Rajpurkar et al., 2016, 2018) and the ReAding Comprehension
Dataset From Examinations (RACE) (Lai et al., 2017). For more information on GLUE,
SuperGLUE, SQUAD and RACE see the explanatory box below.
——————————————————————————————————

NLP Tasks. GLUE comprises nine English language understanding subtasks in which
binary or multi-class classification problems have to be solved (Wang et al., 2019b). The
tasks vary widely regarding training data set size, textual style, and difficulty (Wang
et al., 2019b). The Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) sub-
task, for example, is a single-sentence binary classification task in which the sentiments
expressed in sentences from movie reviews are to be classified as being positive vs. neg-
ative (Wang et al., 2019b). In the subtask based on the Multi-Genre Natural Language
Inference Corpus (MNLI) (Williams et al., 2018), in contrast, the model is presented
with a sentence pair consisting of a premise and a hypothesis. The model has to make a
prediction whether the premise entails, contradicts, or neither entails nor contradicts the
hypothesis (Wang et al., 2019b). SuperGLUE was introduced when the performance of
deep neural networks used for transfer learning surpassed human performance on GLUE
(Wang et al., 2019a). SuperGLUE comprises eight subtasks considered to be more dif-
ficult than the GLUE tasks, e.g. natural language inference or reading comprehension
tasks (Wang et al., 2019a). Models that are evaluated based on SQUAD are presented
with passages from Wikipedia articles and corresponding questions (Rajpurkar et al.,
2016). The task is to return the text segment answering the question or discern that
the question is not answerable by the provided passage (Rajpurkar et al., 2016, 2018).
Finally, RACE is made up of 97, 687 multiple choice questions on 27, 933 text passages
taken from English exams for Chinese students testing reading comprehension ability
(Lai et al., 2017).
——————————————————————————————————

4.3.3 Deep Pretraining
Another issue beside a suitable pretraining task and a large pretraining corpus is the
depth (i.e. the number of hidden layers) of the neural language representation model.
From the perspective of transfer learning, the early seminal word embedding models,
such as continuous bag-of-words (CBOW) (Mikolov et al., 2013a), skip-gram (Mikolov
et al., 2013a,b), and Global Vectors (GloVe) (Pennington et al., 2014), are self-supervised

 18
pretraining approaches (Ruder, 2019). In these models, word embeddings are learned
based on neural network architectures that have no hidden layer (Ruder, 2019). This
implies that these models for each term learn one vector representation, that encodes one
information (Peters et al., 2018). Representing each term with a single vector, however,
may become problematic if the meaning of a term changes with context (e.g. ‘party’ )
(Peters et al., 2018). Moreover, for models to deduce meaning from sequences of words,
several different types of information (e.g. syntactic and semantic information) are likely
to be required (Peters et al., 2018).
In contrast to the early word embedding models, deep learning models learn multi-
layered contextualized representations (e.g. McCann et al., 2018; Howard and Ruder,
2018; Peters et al., 2018). In deep neural networks for NLP, each layer learns one vector
representation for a term (Peters et al., 2018). Hence, a single term is represented by
several vectors—one vector from each layer. Although it cannot be specified a priori
which information is encoded in which hidden layer in a specific model trained on a
specific task, one can expect that the information encoded in lower layers is less complex
and more general whereas information encoded in higher layers is more complex and
more task-specific (Yosinski et al., 2014). The representations learned by a deep neural
language model thus may, for example, encode syntactic aspects at lower layers and
semantic context-dependent information in higher layers (see for example Peters et al.,
2018). As comparisons across NLP benchmark data sets show, deep neural language
models provide contextualized representations of language that generalize better across
a wide range of specific target tasks compared to the one layer representations from early
word embedding architectures (see e.g. McCann et al., 2018).

4.4 Adaptation: Feature Extraction vs. Fine-Tuning
There are two basic ways of how to implement the adaptation phase in transfer learning:
feature extraction vs. fine-tuning (see Figure 5) (Ruder, 2019). In a feature extraction
approach the representations learned in the pretraining phase are frozen and not altered
during adaptation (Ruder, 2019). In fine-tuning, in contrast, the pretrained parameters
are not fixed but are updated in the adaptation phase (Ruder, 2019).
An example for a feature extraction approach is ELMo (Embeddings from Language
Models) (Peters et al., 2018). In pretraining, ELMo learns three layers of word vectors
(Peters et al., 2018). These learned representations then are frozen and extracted to
serve as the input for a target task-specific model that learns a linear combination of
the three layers of pretrained vectors (Peters et al., 2018). Here, only the weights of the
linear model but not the frozen pretrained vectors are trained (Peters et al., 2018).
In fine-tuning typically the same model architecture used in pretraining is also used
for adaptation (Peters et al., 2019). The model is merely added a task-specific output
layer (Peters et al., 2019). The parameters learned in the pretraining phase serve as
initializations for the model in the adaptation phase (Ruder, 2019). When training the
model on the target task, the gradients are allowed to backpropagate to the pretrained

 19
!"#$%&" "'$&#($)*+ !)+" − $%+)+-

 .'($)&' .!"#$%& ⇝ .'($)&'

 .!"#$%& .!"#$%& ⇝ .'($)&'

 .!"#$%& .!"#$%& ⇝ .'($)&'

 .!"#$%& .!"#$%& ⇝ .'($)&'

Figure 5: Feature Extraction vs. Fine-Tuning in the Adaptation Phase. In a
feature extraction approach the parameters learned during pretraining on the source task are
not changed. Only the parameters in the separate target-task model architecture are trained. In
fine-tuning, the entire pretrained model architecture is trained on the target task such that also
the pretrained parameters are updated.

representations and thus to induce changes on these pretrained representations (Ruder,
2019). In contrast to the feature extraction approach, the pretrained parameters hence
are allowed to be fine-tuned to capture task-specific adjustments (Ruder, 2019).
When fine-tuning BERT on a target task, for example, a target task-specific output
layer is put on top of the pretraining architecture (Devlin et al., 2019). Then the entire
architecture is trained, meaning that all parameters (also those learned in pretraining)
are updated (Devlin et al., 2019).
The performance of feature extraction vs. fine-tuning is likely to be a function of the
dissimilarity between the source and the target task (Peters et al., 2019). At present, the
highest performances in text classification and sentiment analysis tasks are achieved by
pretrained language representation models that are fine-tuned to the target task (Ruder,
2020).
Fine-tuning, however, can be a tricky matter. The central parameter in fine-tuning is the
learning rate η with which the gradients are updated during training on the target task
(see equation 8) (Ruder, 2019). Too much fine-tuning (i.e. a too high learning rate) can
lead to catastrophic forgetting—a situation in which the general representations learned
during pretraining are overwritten and therefore forgotten when fine-tuning the model
(Kirkpatrick et al., 2017). A too careful fine-tuning scheme (i.e. a too low learning rate),
in contrast, may lead to a very slow convergence process (Howard and Ruder, 2018). In
general it is recommended that the learning rate should be lower than the learning rate
used during pretraining such that the representations learned during pretraining are not
altered too much (Ruder, 2019).

 20
To effectively fine-tune a pretrained model without catastrophic forgetting, Howard and
Ruder (2018) present a set of learning rate schedules that vary the learning rate over
the course of the adaptation phase. They also introduce discriminative fine-tuning—a
procedure in which each layer has a different learning rate during fine-tuning to account
for the fact that different layers encode different types of information (Howard and
Ruder, 2018; Yosinski et al., 2014).

5 Attention-Based Natural Language Processing: The Trans-
 former
As RNNs and derived architectures as the LSTM model are made to process sequential
input data, they seem the natural model of choice for processing sequences of textual
tokens. The problem of recurrent architectures, however, is that they model dependen-
cies by sequentially propagating through the positions of the input sequence. Whether
or how well dependencies are learned by RNNs is a function of the distance between the
tokens that depend on each other (Goodfellow et al., 2016). The longer the distance
between the tokens, the less well the dependency tends to be learned (Goodfellow et al.,
2016).
A solution to this problem is provided by the attention mechanism, first introduced by
Bahdanau et al. (2015) for Neural Machine Translation (NMT), that allows to model de-
pendencies between tokens irrespective of the distance between them. The Transformer is
a deep learning architecture that is solely based on attention mechanisms (Vaswani et al.,
2017). It overcomes the inefficiencies of recurrent models and introduces self-attention
(Vaswani et al., 2017). To provide a solid understanding of these methods, this section
first explains the attention mechanism and then introduces the Transformer.

5.1 The Attention Mechanism
The common task encountered in NMT is to translate a sequence of S tokens in language
G, {g1 , . . . , gs , . . . , gS }, to a sequence of T tokens in language A, {a1 , . . . , at , . . . , aT }
(Sutskever et al., 2014). The length of the input and output sequences may differ: ‘He
is giving a speech.’ translated to German is ‘Er hält eine Rede.’. The task thus is one of
sequence-to-sequence modeling whereby the sequences are of variable length (Sutskever
et al., 2014).
The classic architecture to solve this task is an encoder-decoder structure (see Figure
6) (Sutskever et al., 2014). The encoder maps the input tokens {g1 , . . . , gs , . . . , gS }
into a single vector of fixed dimensionality, context vector c, that is then provided to
the decoder that generates the sequence of output tokens {a1 , . . . , at , . . . , aT } from c
(Sutskever et al., 2014). In the original NMT articles, encoder and decoder are recurrent
models (Sutskever et al., 2014). Hence, the encoder sequentially processes each input
token embedding z[gs ] . The hidden state at time step s, hs , is a nonlinear function of the

 21
You can also read