Neural Supervised Domain Adaptation by Augmenting Pre-trained Models with Random Units

Page created by Julio Blake
 
CONTINUE READING
Neural Supervised Domain Adaptation by Augmenting Pre-trained
                                                                   Models with Random Units

                                        Sara Meftah∗ , Nasredine Semmar∗ , Youssef Tamaazousti∗ , Hassane Essafi∗ , Fatiha Sadat+
                                                       ∗
                                                         CEA-List, Université Paris-Saclay, F-91120, Palaiseau, France
                                                                        +
                                                                          UQÀM, Montréal, Canada
                                                  {firstname.lastname}@cea.fr, sadat.fatiha@uqam.ca

                                                                  Abstract                          tools that are capable of understanding and generat-
                                                                                                    ing all forms of human languages. Furthermore, in
                                                Neural Transfer Learning (TL) is becoming
                                                ubiquitous in Natural Language Processing           spite of the tremendous empirical results achieved
                                                                                                    by NLP models based on Neural Networks (NNs),
arXiv:2106.04935v1 [cs.CL] 9 Jun 2021

                                                (NLP), thanks to its high performance on many
                                                tasks, especially in low-resourced scenarios.       these models are in most cases based on a super-
                                                Notably, TL is widely used for neural do-           vised learning paradigm, i.e. trained from scratch
                                                main adaptation to transfer valuable knowl-         on large amounts of labelled examples. Never-
                                                edge from high-resource to low-resource do-         theless, such training scheme is not fully optimal.
                                                mains. In the standard fine-tuning scheme of        Indeed, NLP neural models with high performance
                                                TL, a model is initially pre-trained on a source
                                                                                                    often require huge volumes of manually annotated
                                                domain and subsequently fine-tuned on a tar-
                                                get domain and, therefore, source and target        data to produce powerful results and prevent over-
                                                domains are trained using the same architec-        fitting. However, manual data annotation is time-
                                                ture. In this paper, we show through inter-         consuming. Besides, language changes over years
                                                pretation methods that such scheme, despite         (Eisenstein, 2019). Thus, most languages varieties
                                                its efficiency, is suffering from a main limi-      are under-resourced (Baumann and Pierrehumbert,
                                                tation. Indeed, although capable of adapting        2014; Duong, 2017).
                                                to new domains, pre-trained neurons struggle
                                                                                                        Particularly, in spite of the valuable advantage of
                                                with learning certain patterns that are specific
                                                to the target domain. Moreover, we shed light       social media’s content analysis for a variety of ap-
                                                on the hidden negative transfer occurring de-       plications (e.g. advertisement, health, or security),
                                                spite the high relatedness between source and       this large domain is still poor in terms of anno-
                                                target domains, which may mitigate the final        tated data. Furthermore, it has been shown that
                                                gain brought by transfer learning. To address       models intended for news fail to work efficiently
                                                these problems, we propose to augment the           on Tweets (Owoputi et al., 2013). This is mainly
                                                pre-trained model with normalised, weighted
                                                                                                    due to the conversational nature of the text, the
                                                and randomly initialised units that foster a bet-
                                                ter adaptation while maintaining the valuable
                                                                                                    lack of conventional orthography, the noise, lin-
                                                source knowledge. We show that our approach         guistic errors, spelling inconsistencies, informal
                                                exhibits significant improvements to the stan-      abbreviations and the idiosyncratic style of these
                                                dard fine-tuning scheme for neural domain           texts (Horsmann, 2018).
                                                adaptation from the news domain to the so-              One of the best approaches to address this issue
                                                cial media domain on four NLP tasks: part-of-       is Transfer Learning (TL); an approach that allows
                                                speech tagging, chunking, named entity recog-
                                                                                                    handling the problem of the lack of annotated data,
                                                nition and morphosyntactic tagging.1
                                                                                                    whereby relevant knowledge previously learned in
                                        1       Introduction                                        a source problem is leveraged to help in solving a
                                                                                                    new target problem (Pan et al., 2010). In the context
                                        NLP aims to produce resources and tools to under-           of artificial NNs, TL relies on a model learned on a
                                        stand texts coming from standard languages and              source-task with sufficient data, further adapted to
                                        their linguistic varieties, such as dialects or user-       the target-task of interest. TL has been shown to be
                                        generated-content in social media platforms. This           powerful for NLP and outperforms the standard su-
                                        diversity is a challenge for developing high-level          pervised learning from scratch paradigm, because
                                            1
                                                Under review                                        it takes benefit from the pre-learned knowledge.
Particularly, the standard fine-tuning (SFT) scheme          upper-case letter. Thus the pre-trained units fail to
of sequential transfer learning has been shown to            discard this pattern which is not always respected
be efficient for supervised domain adaptation from           in user-generated-content in social media. As a
the source news domain to the target social media            consequence of this phenomenon, specific patterns
domain (Gui et al., 2017; Meftah et al., 2018b,a;            to the target-dataset (e.g. “wanna” or “gonna”) are
März et al., 2019; Zhao et al., 2017; Lin and Lu,           difficult to learn by pre-trained units. This phe-
2018).                                                       nomenon is non-desirable, since such specific units
   In this work we first propose a series of anal-           are essential, especially for target-specific classes
ysis to spot the limits of the standard fine-tuning          (Zhou et al., 2018b; Lakretz et al., 2019).
adaptation scheme of sequential transfer learning.              Stemming from our analysis, we propose a
We start by taking a step towards identifying and            new method to overcome the above-mentioned
analysing the hidden negative transfer when trans-           drawbacks of the standard fine-tuning scheme of
ferring from the news domain to the social me-               transfer learning. Precisely, we propose a hybrid
dia domain. Negative transfer (Rosenstein et al.,            method that takes benefit from both worlds, random
2005; Wang et al., 2019) occurs when the knowl-              initialisation and transfer learning, without their
edge learnt in the source domain hampers the learn-          drawbacks. It consists in augmenting the source-
ing of new knowledge from the target domain.                 network (set of pre-trained units) with randomly
Particularly, when the source and target domains             initialised units (that are by design non-biased) and
are dissimilar, transfer learning may fail and hurt          jointly learn them. We call our method PretRand
the performance, leading to a worse performance              (Pretrained and Random units). PretRand consists
compared to the standard supervised training from            of three main ideas:
scratch. In this work, we rather perceive the gain
brought by the standard fine-tuning scheme com-                1. Augmenting the source-network (set of pre-
pared to random initialisation2 as a combination                  trained units) with a random branch composed
of a positive transfer and a hidden negative trans-               of randomly initialised units, and jointly learn
fer. We define positive transfer as the percentage of             them.
predictions that were wrongly predicted by random              2. Normalising the outputs of both branches to
initialisation, but using transfer learning changed to            balance their different behaviours and thus
the correct ones. The negative transfer represents                forcing the network to consider both.
the percentage of predictions that were tagged cor-
rectly by random initialisation, but using transfer            3. Applying learnable attention weights on both
learning gives incorrect predictions. Hence, the                  branches predictors to let the network learn
final gain brought by transfer learning would be the              which of random or pre-trained one is better
difference between positive and negative transfer.                for every class.
We show that despite the final positive gain brought
by transfer learning from the high-resource news                Our experiments on 4 NLP tasks: Part-of-Speech
domain to the low-resource social media domain,              tagging (POS), Chunking (CK), Named Entity
the hidden negative transfer may mitigate the final          Recognition (NER) and Morphosyntactic Tagging
gain.                                                        (MST) show that PretRand enhances considerably
   Then we perform an interpretive analysis of indi-         the performance compared to the standard fine-
vidual pre-trained neurons behaviours in different           tuning adaptation scheme.4
settings. We find that some of pretrained neurons               The remainder of this paper is organised as fol-
are biased by what they have learnt in the source-           lows. Section 2 presents the background related
dataset. For instance, we observe a unit3 firing             to our work: transfer learning and interpretation
on proper nouns (e.g.“George” and “Washington”)              methods for NLP. Section 3 presents the base neu-
before fine-tuning and on words with capitalised             ral architecture used for sequence labelling in NLP.
first-letter whether the word is a proper noun or            Section 4 describes our proposed methods to anal-
not (e.g. “Man” and “Father”) during fine-tuning.            yse the standard fine-tuning scheme of sequential
Indeed, in news, only proper nouns start with an             transfer learning. Section 5 describes our proposed
   2
                                                             approach PretRand. Section 6 reports the datasets
      Random initialisation means training from scratch on
target data (in-domain data).                                    4
                                                                   This paper is an extension of our previous work (Meftah
    3
      We use “unit” and “neuron” interchangeably.            et al., 2019).
and the experimental setup. Section 7 reports the        models designed for specific high-resourced
experimental results of our proposed methods and         source setting(s) (language, language variety,
is divided into two sub-sections: Sub-section 7.1        domain, task, etc) to work in a target low-resourced
reports the empirical analysis of the standard fine-     setting(s). It includes two categories. First,
tuning scheme, highlighting its drawbacks. Sub-          unsupervised domain adaptation assumes that
section 7.2 presents the experimental results of our     labelled examples in the source domain are
proposed approach PretRand, showing the effec-           sufficiently available, but for the target domain,
tiveness of PretRand on different tasks and datasets     only unlabelled examples are available. Second,
and the impact of incorporating contextualised rep-      in supervised domain adaptation setting, a small
resentations. Finally, section 8 wraps up by dis-        number of labelled target examples are assumed to
cussing our findings and future research directions.     be available.

2     Background                                         Pretraining
                                                         In the pretraining stage of STL, a crucial key for
Since our work involves two research topics: Se-
                                                         the success of transfer is the ruling about the pre-
quential Transfer Learning (STL) and Interpreta-
                                                         trained task and domain. For universal represen-
tion methods, we discuss in the following sub-
                                                         tations, the pre-trained task is expected to encode
sections the state-of-the-art of each topic with a po-
                                                         useful features for a wide number of target tasks
sitioning of our contributions regarding each one.
                                                         and domains. In comparison, for domain adapta-
2.1    Sequential Transfer Learning                      tion, the pre-trained task is expected to be most
                                                         suitable for the target task in mind. We classify
In STL, training is performed in two stages, sequen-     pretraining methods into four main categories: un-
tially: pretraining on the source task, followed         supervised, supervised, multi-task and adversarial
by an adaptation on the downstream target tasks          pretraining:
(Ruder, 2019). The purpose behind using STL
techniques for NLP can be divided into two main             • Unsupervised pretraining uses raw unlabelled
research areas, universal representations and do-             data for pretraining. Particularly, it has been
main adaptation.                                              successfully used in a wide range of semi-
   Universal representations aim to build neural              nal works to learn universal representations.
features (e.g. words embeddings and sentence em-              Language modelling task has been partic-
beddings) that are transferable and beneficial to a           ularly used thanks to its ability to capture
wide range of downstream NLP tasks and domains.               general-purpose features of language.5 For
Indeed, the probabilistic language model proposed             instance, TagLM (Peters et al., 2017) is a pre-
by Bengio et al. (2003) was the genesis of what               trained model based on a bidirectional lan-
we call words embedding in NLP, while Word2Vec                guage model (biLM), also used to generate
(Mikolov et al., 2013) was its outbreak and a start-          ELMo (Embeddings from Language Models)
ing point for a surge of works on learning words em-          representations (Peters et al., 2018). With
beddings: e.g. FastText (Bojanowski et al., 2017)             the recent emergence of the “Transformers”
enriches Word2Vec with subword information. Re-               architectures (Vaswani et al., 2017), many
cently, universal representations re-emerged with             works propose pretrained models based on
contextualised representations, handling a major              these architectures (Devlin et al., 2019; Yang
drawback of traditional words embedding. Indeed,              et al., 2019; Raffel et al., 2019). Unsuper-
these last learn a single context-independent repre-          vised pretraining has also been used to im-
sentation for each word thus ignoring words poly-             prove sequence to sequence learning. We can
semy. Therefore, contextualised words representa-             cite the work of Ramachandran et al. (2017)
tions aim to learn context-dependent word embed-              who proposed to improve the performance of
dings, i.e. considering the entire sequence as input          an encoder-decoder neural machine transla-
to produce each word’s embedding.                             tion model by initialising both encoder and
   While universal representations seek to be                 decoder parameters with pretrained weights
propitious for any downstream task, domain                  5
                                                              Note that language modelling is also considered as a
adaptation is designed for particular target tasks.      self-supervised task since, in fact, labels are automatically
Domain adaptation consists in adapting NLP               generated from raw data.
from two language models.                            initialised layers are added on top of pretrained
                                                          ones. Three main adaptation schemes are used in
   • Supervised pretraining has been particularly         sequential transfer learning: Feature Extraction,
     used for cross-lingual transfer (e.g. machine        Fine-Tuning and the recent Residual Adapters.
     translation (Zoph and Knight, 2016)), cross-
                                                             In a Feature Extraction scheme, the pretrained
     task transfer from POS tagging to words seg-
                                                          layers’ weights are frozen during adaptation, while
     mentation task (Yang et al., 2017) and cross-
                                                          in Fine-Tuning scheme weights are tuned. Accord-
     domain transfer for biomedical texts for ques-
                                                          ingly, the former is computationally inexpensive
     tion answering by Wiese et al. (2017) and
                                                          while the last allows better adaptation to target do-
     for NER by Giorgi and Bader (2018). Cross-
                                                          mains peculiarities. In general, fine-tuning pre-
     domain transfer has also been used to transfer
                                                          trained models begets better results, except in cases
     from news to social media texts for POS tag-
                                                          wherein the target domain’s annotations are sparse
     ging (Meftah et al., 2017; März et al., 2019)
                                                          or noisy (Dhingra et al., 2017; Mou et al., 2016).
     and sentiment analysis (Zhao et al., 2017). Su-
                                                          Peters et al. (2019) found that for contextualised
     pervised pretraining has been also used ef-
                                                          representations, both adaptation schemes are com-
     fectively for universal representations learn-
                                                          petitive, but the appropriate adaptation scheme to
     ing, e.g. neural machine translation (McCann
                                                          pick depends on the similarity between the source
     et al., 2017), language inference (Conneau
                                                          and target problems. Recently, Residual Adapters
     et al., 2017) and discourse relations (Nie et al.,
                                                          were proposed by Houlsby et al. (2019) to adapt
     2017).
                                                          pretrained models based on Transformers archi-
   • Multi-task pretraining has been successfully         tecture, aiming to keep Fine-Tuning scheme’s ad-
     applied to learn general universal sentence          vantages while reducing the number of parame-
     representations by a simultaneous pretrain-          ters to update during the adaptation stage. This
     ing on a set of supervised and unsuper-              is achieved by adding adapters (intermediate lay-
     vised tasks (Subramanian et al., 2018; Cer           ers with a small number of parameters) on top of
     et al., 2018). Subramanian et al. (2018),            each pretrained layer. Thus, pretrained layers are
     for instance, proposed to learn universal sen-       frozen, and only adapters are updated during train-
     tences representations by a joint pretraining        ing. Therefore, Residual Adapters performance is
     on skip-thoughts, machine translation, con-          near to Fine-tuning while being computationally
     stituency parsing, and natural language infer-       cheaper (Pfeiffer et al., 2020b,a,c).
     ence. For domain adaptation, we have per-            Our work
     formed in (Meftah et al., 2020) a multi-task
                                                          Our work falls under supervised domain adaptation
     pretraining for supervised domain adaptation
                                                          research area. Specifically, cross-domain adapta-
     from the news domain to the social media do-
                                                          tion from the news domain to the social media do-
     main.
                                                          main. The fine-tuning adaptation scheme has been
   • Adversarial pretraining is particularly used         successfully applied on domain adaptation from
     for domain adaptation when some annotated            the news domain to the social media domain (e.g.
     examples from the target domain are avail-           adversarial pretraining (Gui et al., 2017) and super-
     able. Adversarial training (Ganin et al., 2016)      vised pretraining (Meftah et al., 2018a)). In this
     is used as a pretraining step followed by an         research, we highlight the aforementioned draw-
     adaptation step on the target dataset. Adver-        backs (biased pre-trained units and the hidden neg-
     sarial pretraining demonstrated its effective-       ative transfer) of the standard fine-tuning adapta-
     ness in several NLP tasks, e.g. cross-lingual        tion scheme. Then, we propose a new adaptation
     sentiment analysis (Chen et al., 2018). Also,        scheme (PretRand) to handle these problems. Fur-
     it has been used to learn cross-lingual words        thermore, while ELMo contextualised words repre-
     embeddings (Lample et al., 2018).                    sentations efficiency has been proven for different
                                                          tasks and datasets (Peters et al., 2019; Fecht et al.,
Adaptation                                                2019; Schumacher and Dredze, 2019), here we in-
During the adaptation stage of STL, one or more           vestigate their impact when used, simultaneously,
layers from the pretrained model are transferred to       with a sequential transfer learning scheme for su-
the downstream task, and one or more randomly             pervised domain adaptation.
2.2   Interpretation methods for NLP                     computer vision (Coates and Ng, 2011; Girshick
                                                         et al., 2014; Zhou et al., 2015), and more recently
Recently, a rising interest is devoted to peek inside
                                                         in NLP, wherein units activations are visualised
black-box neural NLP models to interpret their
                                                         in heatmaps. For instance, Karpathy et al. (2016)
internal representations and their functioning. A
                                                         visualised character-level Long Short-Term Mem-
variety of methods were proposed in the literature,
                                                         ory (LSTM) cells learned in language modelling
here we only discuss those that are most related to
                                                         and found multiple interpretable units that track
our research.
                                                         long-distance dependencies, such as line lengths
                                                         and quotes; Radford et al. (2017) visualised a unit
Probing tasks is a common approach for NLP               which performs sentiment analysis in a language
models analysis used to investigate which                model based on Recurrent Neural Networks
linguistic properties are encoded in the latent          (RNNs); Bau et al. (2019) visualised neurons
representations of the neural model (Shi et al.,         specialised on tense, gender, number, etc. in
2016). Concretely, given a neural model M                NMT models; and Kádár et al. (2017) proposed
trained on a particular NLP task, whether it is          top-k-contexts approach to identify sentences, an
unsupervised (e.g. language modelling (LM))              thus linguistic patterns, sparking the highest acti-
or supervised (e.g. Neural Machine Translation           vation values of each unit in an RNNs-based model.
(NMT)), a shallow classifier is trained on top
of the frozen M on a corpus annotated with the
linguistic properties of interest. The aim is to         Neural representations correlation analysis:
examine whether M’s hidden representations               Cross-network and cross-layers correlation is
encode the property of interest. For instance, Shi       a significant approach to gain insights on how
et al. (2016) found that different levels of syntactic   internal representations may vary across networks,
information are learned by NMT encoder’s layers.         network-depth and training time.             Suitable
Adi et al. (2016) investigated what information          approaches are based on Correlation Canonical
(between sentence length, words order and                Analysis (CCA) (Hotelling, 1992; Uurtio et al.,
word-content) is captured by different sentence          2018), such as Singular Vector Canonical Correla-
embedding learning methods. Conneau et al.               tion Analysis (Raghu et al., 2017) and Projected
(2018) proposed 10 probing tasks annotated with          Weighted Canonical Correlation Analysis (Morcos
fine-grained linguistic properties and compared          et al., 2018), that were successfully used in NLP
different approaches for sentence embeddings. Zhu        neural models analysis. For instance, it was used
et al. (2018) inspected which semantic properties        by Bau et al. (2019) to calculate cross-networks
(e.g. negation, synonymy, etc.) are encoded              correlation for ranking important neurons in NMT
by different sentence embeddings approaches.             and LM. Saphra and Lopez (2019) applied it to
Furthermore, the emergence of contextualised             probe the evolution of syntactic, semantic, and
words representations have triggered a surge of          topic representations cross-time and cross-layers.
works on probing what these representations are          Raghu et al. (2019) compared the internal rep-
learning (Liu et al., 2019a; Clark et al., 2019).        resentations of models trained from scratch vs
This approach, however, suffers from two main            models initialised with pre-trained weights. CCA
flaws. First, probing tasks examine properties           based methods aim to calculate similarity between
captured by the model at a coarse-grained level,         neural representations at the coarse-grained level.
i.e. layers representations, and thereby, will not       In contrast, correlation analysis at the fine-grained
identify features captured by individual neurons.        level, i.e. between individual neurons, has also
Second, probing tasks will not identify linguistic       been explored in the literature. Initially, Li et al.
properties that do not appear in the annotated           (2015) used Pearson’s correlation to examine to
probing datasets (Zhou et al., 2018a).                   which extent each individual unit is correlated to
                                                         another unit, either within the same network or
                                                         between different networks. The same correlation
Individual units stimulus: Inspired by works on
                                                         metric was used by Bau et al. (2019) to determine
receptive fields of biological neurons (Hubel and
                                                         important neurons in NMT and LM tasks.
Wiesel, 1965), much work has been devoted for
interpreting and visualising individual hidden units
stimulus-features in neural networks. Initially, in      Our Work:
In this work, we propose two approaches (§4.2)                   model on M annotated sentences, the model’s loss
to highlight the bias effect in the standard fine-               is defined as follows:
tuning scheme of transfer learning in NLP, the first
                                                                                         mi
                                                                                       M X
method is based on individual units stimulus and                                       X
                                                                                L =                 L(i,t) .       (2)
the second on neural representations correlation
                                                                                        i=1   t=1
analysis. To the best of our knowledge, we are
the first to harness these interpretation methods                4     Analysis of the Standard Fine-Tuning
to analyse individual units behaviour in a transfer                    Scheme
learning scheme. Furthermore, the most analysed
                                                                 The standard fine-tuning scheme consists in trans-
tasks in the literature are Natural Language Infer-
                                                                 ferring a part of the learned weights from a source
ence, NMT and LM (Belinkov and Glass, 2019),
                                                                 model to initialise the target model, which is further
here we target under-explored tasks in visualisation
                                                                 fine-tuned on the target task with a small number of
works such as POS, MST, CK and NER.
                                                                 training examples from the target domain. Given a
3    Base Neural Sequence Labelling Model                        source neural network Ms with a set of parameters
                                                                 θs split into two sets: θs = (θs1 , θs2 ) and a target
Given an input sentence S of n successive tokens                 network Mt with a set of parameters θt split into
S = [w1 , . . . , wn ], the goal of sequence labelling           two sets: θt = (θt1 , θt2 ), the standard fine-tuning
is to predict the label ct ∈ C of every wt , with                scheme of transfer learning includes three simple
C being the tag-set. We use a commonly used                      yet effective steps:
end-to-end neural sequence labelling model (Ma
and Hovy, 2016; Plank et al., 2016; Yang et al.,                     1. We train the source model on annotated data
2018), which is composed of three components (il-                       from the source domain on a source dataset.
lustrated in Figure 1). First, the Word Represen-                    2. We transfer the first set of parameters from
tation Extractor (WRE), denoted Υ, computes a                           the source network Ms to the target network
vector representation xt for each token wt . Second,                    Mt : θt1 = θs1 , whereas the second set θt2 of
this representation is fed into a Feature Extrac-                       parameters is randomly initialised.
tor (FE) based on a bidirectional Long Short-Term
Memory (biLSTM) network (Graves et al., 2013),                       3. Then, the target model is further fine-tuned on
denoted Φ. It produces a hidden representation, ht ,                    the small target data-set.
that is fed into a Classifier (Cl): a fully-connected
                                                                    Source and target datasets may have different
layer (FCL), denoted Ψ. Formally, given wt , the
                                                                 tag-sets, even within the same NLP task. Hence,
logits are obtained using the following equation:
                                                                 transferring the parameters of the classifier (Ψ) may
ŷt = (Ψ ◦ Φ ◦ Υ)(wt ).6
                                                                 not be feasible in all cases. Therefore, in our ex-
   In the standard supervised training scheme, the
                                                                 periments, WRE’s layers (Υ) and FE’s layers (Φ)
three modules are jointly trained from scratch by
                                                                 are initialised with the source model’s weights and
minimising the Softmax Cross-Entropy (SCE) loss
                                                                 Ψ is randomly initialised. Then, the three modules
using the Stochastic Gradient Descent (SGD) algo-
                                                                 are further jointly trained on the target-dataset by
rithm.
                                                                 minimising a SCE loss using the SGD algorithm.
   Let us consider a training set of M annotated
sentences, where each sentence i is composed of                  4.1    The Hidden Negative Transfer
mi tokens. Given a training word (wi,t , yi,t ) from             It has been shown in many works in the literature
the training sentence i, where yi,t is the gold stan-            (Rosenstein et al., 2005; Ge et al., 2014; Ruder,
dard label for the word wi,t , the cross-entropy loss            2019; Gui et al., 2018; Cao et al., 2018; Chen et al.,
for this example is calculated as follows:                       2019; Wang et al., 2019; O’Neill, 2019) that, when
                                                                 the source and target domains are less related (e.g.
           L(i,t) = − yi,t × log(ŷi,t ) .               (1)     languages from different families), sequential trans-
Thus, during the training of the sequence labelling              fer learning may lead to a negative effect on the
                                                                 performance, instead of improving it. This phe-
    6
      For simplicity, we define ŷt only as a function of wt .   nomenon is referred to as negative transfer. Pre-
In reality, the prediction ŷt for the word wt is also a func-
tion of the remaining words in the sentence and the model’s      cisely, negative transfer is considered when trans-
parameters, in addition to wt .                                  fer learning is harmful to the target task/dataset,
Figure 1: Illustrative scheme of the base neural model for sequence labelling tasks.

i.e. the performance when using transfer learning          age of tokens that were wrongly predicted by ran-
algorithm is lower than that with a solely super-          dom initialisation, but the SFT changed to the cor-
vised training on in-target data (Torrey and Shavlik,      rect ones. Negative transfer N T i represents the
2010).                                                     percentage of words that were tagged correctly by
   In NLP, negative transfer phenomenon has only           random initialisation, but using SFT gives wrong
seldom been studied. We can cite the recent work           predictions. PT i and N T i are defined as follows:
of Kocmi (2020) who evaluated the negative trans-
fer in transfer learning in neural machine transla-                                  Nicorrected
                                                                            PT i =               ,          (4)
tion when the transfer is performed between dif-                                         Ni
ferent language-pairs. They found that: 1) The
distributions mismatch between source and target                                     Nif alsif ied
                                                                           NT i =                  ,        (5)
language-pairs does not beget a negative transfer.                                       Ni
2) The transfer may have a negative impact when
                                                           where Ni is the total number of tokens in the
the source language-pair is less-resourced com-
                                                           validation-set, Nicorrected is the number of tokens
pared to the target one, in terms of annotated exam-
                                                           from the validation-set that were wrongly tagged
ples.
                                                           by the model trained from scratch but are correctly
   Our experiments in (Meftah et al., 2018a,b) have
                                                           predicted by the SFT scheme, and Nif alsif ied is the
shown that transfer learning techniques from the
                                                           number of tokens from the validation-set that were
news domain to the social media domain using
                                                           correctly tagged by the model trained from scratch
the standard fine-tuning scheme boosts the tagging
                                                           but are wrongly predicted by the SFT scheme.
performance. Hence, following the above defini-
tion, transfer learning from news to social media          4.2   Interpretation of Pretrained Neurons
does not beget a negative transfer. Contrariwise,
                                                           Here, we propose to perform a set of analysis tech-
in this work, we instead consider the hidden nega-
                                                           niques to gain some insights into how the inner
tive transfer, i.e. the percentage of predictions that
                                                           pretrained representations are updated during fine-
were correctly tagged by random initialisation, but
                                                           tuning on social media datasets when using the
using transfer learning gives wrong predictions.
                                                           standard fine-tuning scheme of transfer learning.
   Let us consider the gain Gi brought by the stan-
                                                           For this, we propose to analyse the feature extrac-
dard fine-tuning scheme (SFT) of transfer learning
                                                           tor’s (Φ) activations. Precisely, we attempt to visu-
compared to the random initialisation for a dataset
                                                           alise biased neurons, i.e. pre-trained neurons that
i. Gi is defined as the difference between positive
                                                           do not change that much from their initial state.
transfer PT i and negative transfer N T i :
                                                              Let us consider a validation-set of N words,
                                                           the feature extractor Φ generates a matrix
                Gi = PT i − N T i ,                (3)
                                                           h ∈ MN,H (R) of activations over all the words
where positive transfer PT i represents the percent-       of the validation-set, where Mf ,g (R) is the space
of f × g matrices over R and H is the size of the           interested by the matrix diagonal, where cjj rep-
hidden representation (number of neurons). Each             resents the charge of each unit j from Φ, i.e. the
element hi,j from the matrix represents the activa-         correlation between each unit’s activations after
tion of the neuron j on the word wi .                       fine-tuning to its activations before fine-tuning.
   Given two models, the first before fine-tuning
                                                            4.2.2    Visualising the Evolution of Pretrained
and the second after fine-tuning, we obtain
                                                                     Neurons Stimulus during Fine-tuning
two matrices hbef ore           ∈      MN,H (R) and
h af ter ∈ MN,H (R), which give the activations of          Here, we perform units visualisation at the
Φ over all validation-set’s words before and after          individual-level to gain insights on how the pat-
fine-tuning, respectively.                                  terns encoded by individual units progress during
   We aim to visualise and quantify the change of           fine-tuning when using the SFT scheme. To do
the representations generated by the model from             this, we generate top-k activated words by each
the initial state, hbef ore (before fine-tuning), to the    unit; i.e. words in the validation-set that fire the
final state, haf ter (after fine-tuning). For this pur-     most the said unit, positively and negatively (since
pose, we perform two experiments:                           LSTMs generate positive and negative activations).
                                                            In (Kádár et al., 2017), top-k activated contexts
  1. Quantifying the change of pretrained individ-          from the model were plotted at the end of training
     ual neurons (§4.2.1);                                  (the best model), which shows on what each unit
  2. Visualising the evolution of pretrained neu-           is specialised, but it does not give insights about
     rons stimulus during fine-tuning (§4.2.2).             how the said unit is evolving and changing during
                                                            training. Thus, taking into account only the final
4.2.1   Quantifying the change of individual                state of training does not reveal the whole picture.
        pretrained neurons                                  Here, we instead propose to generate and plot top-k
In order to quantify the change of the knowledge            words activating each unit throughout the adapta-
encoded in pretrained neurons after fine-tuning,            tion stage. We follow two main steps (as illustrated
we propose to calculate the similarity (correlation)        in Figure 3):
between neurons activations before and after fine-
tuning, when using the SFT adaptation scheme.                 1. We represent each unit j from Φ with a ran-
Precisely, we calculate the correlation coefficient              dom matrix A(j) ∈ MN,D (R) of the said
between each neuron’s activations on the target-                 unit’s activations on all the validation-set at
domain validation-set before starting fine-tuning                different training epochs, where D is the num-
and at the end of fine-tuning.                                   ber of epochs and N is the number of words
                                                                                                             (j)
   Following the above formulation and as illus-                 in the validation-set. Thus, each element ay,z
trated in Figure 2, from hbef ore and haf ter matri-             represents the activation of the unit j on the
ces, we extract two vectors hbef.j
                                   ore
                                         ∈ RN and                word wy at the epoch z.
haf
 .j
    ter
        ∈ RN , representing respectively the acti-            2. We carry out a sorting of each column of the
vations of a unit j over all validation-set’s words              matrix (each column represents an epoch) and
before and after fine-tuning. Next, we generate                  pick the higher k words (for top-k words firing
an asymmetric correlation matrix C ∈ MH,H (R),                   the unit positively) and the lowest k words
where each element cjt in the matrix represents the              (for top-k words firing the unit negatively),
Pearson’s correlation between the activation vector                                           (j)
                                                                 leading to two matrices, Abest+ ∈ MD,k (R)
of unit j after fine-tuning (haf
                              .j
                                 ter
                                     ) and the activa-                  (j)
                                                                 and Abest− ∈ MD,k (R), the first for top-k
tion vector of unit t before fine-tuning     (hbef
                                               .t
                                                   ore
                                                       ),
                                                                 words activating positively the unit j at each
computed as follows:
                                                                 training epoch, and the last for top-k words
                                                                 activating negatively the unit j at each training
        E[(haf
            .j
               ter
                   − µaf
                      j
                         ter
                             )(hbef
                                .t
                                    ore
                                        − µbef
                                           t
                                               ore
                                                   )]            epoch.
cjt =                                                   .
                       σjaf ter σtbef ore
                                                 (6)
Here µbef
        j
           ore
               and σj
                     bef ore
                             represent, respectively,
the mean and the standard deviation of unit j ac-
tivations over the validation set. Clearly, we are
Figure 2: Illustrative scheme of the computation of the charge of unit j, i.e. the Pearson correlation between unit
j activations vector after fine-tuning to its activations vector before fine-tuning.

                                                                                             (j)
Figure 3: Illustrative scheme of the calculus of top-k-words activating unit j, positively (Abest+ ) and negatively
  (j)                                      z
(Abest− ) during fine-tuning epochs. hepoch states for Φ’s outputs at epoch number z.
5     Joint Learning of Pretrained and                 pre-trained branch predicts class-probabilities fol-
      Random Units: PretRand                           lowing:

We found from our analysis (in section 7.1) on pre-
                                                                         ŷip = (Ψp ◦ Φp )(xi ),                  (7)
trained neurons behaviours, that the standard fine-
tuning scheme suffers from a main limitation. In-      with xi = Υ(wi ). Likewise, the additional random
deed, some pre-trained neurons still biased by what    branch predicts class-probabilities following:
they have learned from the source domain despite
the fine-tuning on target domain. We thus propose                        ŷir = (Ψr ◦ Φr )(xi ).                  (8)
a new adaptation scheme, PretRand, to take bene-
fit from both worlds, the pre-learned knowledge in     To get the final predictions, we simply apply an
the pretrained neurons and the target-specific fea-    element-wise sum between the outputs of the pre-
tures easily learnt by random neurons. PretRand,       trained branch and the random branch:
illustrated in Figure 4, consists of three steps:
                                                                             ŷi = ŷip ⊕ ŷir .                  (9)
    1. Augmenting the pre-trained branch with a        As in the classical scheme, the SCE loss is min-
       random one to facilitate the learning of new    imised but here, both branches are trained jointly.
       target-specific patterns (§5.1);
                                                       5.2    Independent Normalisation
    2. Normalising both branches to balance their
       behaviours during fine-tuning (§5.2);           Our first implementation of adding the random
                                                       branch was less effective than expected. The main
    3. Applying learnable weights on both branches     explanation is that the pre-trained units were dom-
       to let the network learn which of random        inating the random units, which means that the
       or pre-trained one is better for every class.   weights as well as the gradients and outputs of pre-
       (§5.3).                                         trained units absorb those of the random units. As
                                                       illustrated in the left plot of Figure 5, the absorption
                                                       phenomenon stays true even at the end of the train-
5.1    Adding the Random Branch                        ing process; we observe that random units weights
We expect that augmenting the pretrained model         are closer to zero. This absorption propriety handi-
with new randomly initialised neurons allows a         caps the random units in firing on the words of the
better adaptation during fine-tuning. Thus, in the     target dataset.7
adaptation stage, we augment the pre-trained model        To alleviate this absorption phenomenon and
with a random branch consisting of additional ran-     push the random units to be more competitive, we
dom units (as illustrated in the scheme “a” of Fig-    normalise the outputs of both branches (ŷip and ŷir )
ure 4). Several works have shown that deep (top)       using the `2 -norm, as illustrated in the scheme “b”
layers are more task-specific than shallow (low)       of Figure 4. The normalisation of a vector “x” is
ones (Peters et al., 2018; Mou et al., 2016). Thus,    computed using the following formula:
deep layers learn generic features easily transfer-
                                                                                        xi i=|x|
able between tasks. In addition, word embeddings                        N2 (x) = [          ]    .              (10)
(shallow layers) contain the majority of parameters.                                  ||x||2 i=1
Based on these factors, we choose to expand only       Thanks to this normalisation, the absorption phe-
the top layers as a trade-off between performance      nomenon was solved, and the random branch starts
and number of parameters (model complexity). In        to be more effective (see the right distribution of
terms of the expanded layers, we add an extra biL-     Figure 5).
STM layer of k units in the FE (Φr - r for random);       Furthermore, we have observed that despite the
and a new fully-connected layer of C units (called     normalisation, the performance of the pre-trained
Ψr ). With this choice, we increase the complexity     classifiers is still much better than the randomly
of the model only 1.02× compared to the base one       initialised ones. Thus, to make them more com-
(The standard fine-tuning scheme).                     petitive, we propose to start with optimising only
   Concretely, for every wi , two predictions vec-        7
                                                           The same problem was stated in some computer-vision
tors are computed; ŷip from the pre-trained branch    works (Liu et al., 2015; Wang et al., 2017; Tamaazousti et al.,
and ŷir from the random one. Specifically, the        2017).
Figure 4: Illustrative scheme of the three ideas composing our proposed adaptation method, PretRand. a) We
augment the pre-trained branch (grey branch) with a randomly initialised one (green branch) and jointly adapt them
with pre-trained ones (grey branch). An element-wise sum is further applied to merge the two branches. b) Before
merging, we balance the different behaviours of pre-trained and random units, using an independent normalisation
(N). c) Finally we let the network learn which of pre-trained or random neurons are more suited for every class,
by performing an element-wise product of the FC layers with learnable weighting vectors (u and v initialised with
1-values).
6     Experimental Settings

                                                            6.1    Datasets

                                                            We conduct experiments on supervised domain
                                                            adaptation from the news domain (formal texts) to
Figure 5: The distributions of the learnt weight-values     the social media domain (noisy texts) for English
for the randomly initialised (green) and pre-trained        Part-Of-Speech tagging (POS), Chunking (CK)
(grey) fully-connected layers after their joint training.   and Named Entity Recognition (NER). In addi-
Left: without normalisation, right: with normalisation.
                                                            tion, we experiment on Morpho-syntactic Tagging
                                                            (MST) of three South-Slavic languages: Slovene,
                                                            Croatian and Serbian. For POS task, we use the
the randomly initialised units while freezing the
                                                            WSJ part of Penn-Tree-Bank (PTB) (Marcus et al.,
pre-trained ones, then, we launch the joint training.
                                                            1993) news dataset for the source news domain and
We call this technique random++.
                                                            TPoS (Ritter et al., 2011), ArK (Owoputi et al.,
                                                            2013) and TweeBank (Liu et al., 2018) for the tar-
5.3    Attention Learnable Weighting Vectors                get social media domain. For CK task, we use
                                                            the CONLL2000 (Tjong Kim Sang and Buchholz,
Heretofore, pre-trained and random branches par-            2000) dataset for the news source domain and
ticipate equally for every class’ predictions, i.e. we      TChunk (Ritter et al., 2011) for the target domain.
do not weight the dimensions of ŷip and ŷir before        For NER task, we use the CONLL2003 dataset
merging them with an element-wise summation.                (Tjong Kim Sang and De Meulder, 2003) for the
Nevertheless, random classifiers may be more effi-          source news domain and WNUT-17 dataset (Der-
cient for specific classes compared to pre-trained          czynski et al., 2017) for the social media target
ones and vice-versa. In other terms, we do not              domain. For MST, we use the MTT shared-task
know which of the two branches (random or pre-              (Zampieri et al., 2018) benchmark containing two
trained) is better for making a suitable decision for       types of datasets: social media and news, for three
each class. For instance, if the random branch is           south-Slavic languages: Slovene (sl), Croatian (hr)
more efficient for predicting a particular class cj , it    and Serbian (sr). Statistics of all the datasets are
would be better to give more attention to its outputs       summarised in Table 1.
concerning the class cj compared to the pretrained
branch.
    Therefore, instead of simply performing an              6.2    Evaluation Metrics
element-wise sum between the random and pre-
                                                            We evaluate our models using metrics that are com-
trained predictions, we first weight ŷip with a learn-
                                                            monly used by the community. Specifically, accu-
able weighting vector u ∈ RC and ŷir with a
                                                            racy (acc.) for POS, MST and CK and entity-level
learnable weighting vector v ∈ RC , where C is
                                                            F1 for NER.
the tagset size (number of classes). Such as, the
element uj from the vector u represents the ran-               Comparison criteria: A common approach to
dom branch’s attention weight for the class cj , and        compare the performance between different ap-
the element vj from the vector v represents the             proaches across different datasets and tasks is to
pretrained branch’s attention weight for the class          take the average of each approach across all tasks
cj . Then, we compute a Hadamard product with               and datasets. However, as it has been discussed in
their associated normalised predictions (see the            many research papers (Subramanian et al., 2018;
scheme “c” of Figure 4). Both vectors u and v               Rebuffi et al., 2017; Tamaazousti, 2018), when
are initialised with 1-values and are fine-tuned by         tasks are not evaluated using the same metrics or
back-propagation. Formally, the final predictions           results across datasets are not of the same order
are computed as follows:                                    of magnitude, the simple average does not allow a
                                                            “coherent aggregation”. For this, we use the aver-
                                                            age Normalized Relative Gain (aNRG) proposed by
                                                            Tamaazousti et al. (2019), where a score aNRGi
      ŷi = u    Np (ŷip ) ⊕ v      Np (ŷir ).   (11)     for each approach i is calculated compared to a
Task                            #Classes   Sources         Eval. Metrics           # Tokens-splits (train - val - test)
           POS: POS Tagging                      36   WSJ             Top-1 Acc.              912,344 - 131,768 - 129,654
           CK: Chunking                          22   CONLL-2000      Top-1 Acc.              211,727 - n/a - 47,377
           NER: Named Entity Recognition          4   CONLL-2003      Top-1 Exact-match F1.   203,621 - 51,362 - 46,435
                                              1304    Slovene-news    Top-1 Acc.              439k - 58k - 88k
           MST: Morpho-syntactic Tagging       772    Croatian-news   Top-1 Acc.              379k - 50k - 75k
                                               557    Serbian-news    Top-1 Acc.              59k - 11k, 16k
                                                40    TPoS            Top-1 Acc.              10,500 - 2,300 - 2,900
           POS: POS Tagging                     25    ArK             Top-1 Acc.              26,500 - / - 7,700
                                                17    TweeBank        Top-1 Acc.              24,753 - 11,742 - 19,112
           CK: Chunking                         18    TChunk          Top-1 Top-1 Acc..       10,652 - 2,242 - 2,291
           NER: Named Entity Recognition         6    WNUT-17         Top-1 Exact-match F1.   62,729 - 15,734 - 23,394
                                              1102    Slovene-sm      Top-1 Acc.              37,756 - 7,056 - 19,296
           MST: Morpho-syntactic Tagging       654    Croatian-sm     Top-1 Acc.              45,609 - 8,886 - 21,412
                                               589    Serbian-sm      Top-1 Acc.              45,708- 9,581- 23,327

Table 1: Statistics of the used datasets. Top: datasets of the source domain. Bottom: datasets of the target domain.

reference approach (baseline) as follows:                             ELMo pre-trained models are not available but for
                                                                      Croatian (Che et al., 2018).10 Note that, in all ex-
                                                                      periments contextual embeddings are frozen during
                 L      i   ref
              1 X (sj − sj )                                          training.
      aNRGi =                      ,                      (12)
              L       max − sref )                                    FE’s HP: we use a single biLSTM layer (token-
                j=1 (sj       j
                                                                      level feature extractor) and set the number of units
with sij being the score of the approachi on                          to 200.
                                                                      PretRand’s random branch HP: we experiment
the datasetj , sref
                j   being the score of the refer-
                                                                      our approach with k = 200 added random-units.
ence approach on the datasetj and smaxj   is the
                                                                      Global HP: In all experiments, training (pretrain-
best achieved score across all approaches on the
                                                                      ing and fine-tuning) are performed using the SGD
datasetj .
                                                                      with momentum with early stopping, mini-batches
6.3    Implementation Details                                         of 16 sentences and learning rate of 1.5 × 10−2 .
                                                                      All our models are implemented with the PyTorch
We use the following Hyper-Parameters (HP):                           library (Paszke et al., 2017).
WRE’s HP: In the standard word-level embed-
dings, tokens are lower-cased while the character-                    7     Experimental Results
level component still retains access to the capitali-
                                                                      This section reports all our experimental results
sation information. We set the randomly initialised
                                                                      and analysis. First we analyse the standard fine-
character embedding dimension at 50, the dimen-
                                                                      tuning scheme of transfer learning (§7.1). Then we
sion of hidden states of the character-level biLSTM
                                                                      assess the performance of our proposed approach,
at 100 and used 300-dimensional word-level em-
                                                                      PretRand (§7.2).
beddings. The latter were pre-loaded from publicly
available GloVe pre-trained vectors on 42 billions                    7.1    Analysis of the Standard Fine-tuning
words from a web crawling and containing 1.9M                                Scheme
words (Pennington et al., 2014) for English ex-                       We report in Table 2 the results of the reference
periments, and pre-loaded from publicly available                     supervised training scheme from scratch, followed
FastText (Bojanowski et al., 2017) pre-trained vec-                   by the results of the standard fine-tuning scheme,
tors on common crawl for South-Slavic languages.8                     which outperforms the reference. Precisely, trans-
These embeddings are also updated during training.                    fer learning exhibits an improvement of ∼+3% acc.
For experiments with contextual words embeddings                      for TPoS, ∼+1.2% acc. for ArK, ∼+1.6% acc. for
(§7.2.3), we used ELMo (Embeddings from Lan-                          TweeBank, ∼+3.4% acc. for TChunk and ∼+4.5%
guage Models) embeddings (Peters et al., 2018).                       F1 for WNUT.
For English, we use the small official pre-trained                       In the following we provide the results of our
ELMo model on 1 billion word benchmark (13.6M                         analysis of the standard fine-tuning scheme:
parameters).9 Regarding South-Slavic languages,
                                                                          1. Analysis of the hidden negative transfer
   8
     https://github.com/facebookresearch/                                    (§7.1.1).
fastText/blob/master/docs/crawl-vectors.
md                                                                      10
                                                                           https://github.com/HIT-SCIR/
   9
     https://allennlp.org/elmo                                        ELMoForManyLangs
POS (Acc.)                    CK (Acc.)        NER (F1)
                         Dataset         TPoS         ARK       Tweebank           TChunk           WNUT
           Method                     dev    test      test    dev    test       dev    test         test
           From scratch              88.52 86.82      90.89 91.61 91.66         87.76 85.83         36.75
           Standard Fine-tuning      90.95 89.79      92.09 93.04 93.29         90.71 89.21         41.25

Table 2: The main results of our proposed approach, transferring pretrained models, on social media datasets (Acc
(%) for POS and CK and F1 (%) for NER). The best score for each dataset is highlighted in bold.

  2. Quantifying the change of individual pre-              show the percentage of positive transfer and red
     trained neurons after fine-tuning (§7.1.2).            bars give the percentage of negative transfer. We
                                                            observe that even though the standard fine-tuning
  3. Visualising the evolution of pretrained neu-           approach is effective since the resulting positive
     rons stimulus during fine-tuning (§7.1.3).             transfer is higher than the negative transfer in all
                                                            cases, this last mitigates the final gain brought by
7.1.1   Analysis of the Hidden Negative                     the standard fine-tuning. For instance, for TChunk
        Transfer                                            dataset, standard fine-tuning corrected ∼4.7% of
To investigate the hidden negative transfer in the          predictions but falsified ∼1.7%, which reduces the
standard fine-tuning scheme of transfer learning,           final gain to ∼3%.11
we propose the following experiments. First, we
show that the final gain brought by the standard            Qualitative Examples of Negative Transfer
fine-tuning can be separated into two categories:           We report in Table 3 concrete examples of words
positive transfer and negative transfer. Second,            whose predictions were falsified when using the
we provide some qualitative examples of negative            standard fine-tuning scheme compared to standard
transfer.                                                   supervised training scheme. Among mistakes we
                                                            have observed:
Quantifying Positive Transfer & Negative Transfer
                                                               • Tokens with an upper-cased first letter: In
                                                                 news (formal English), only proper nouns
                                                                 start with an upper-case letter inside sentences.
                                                                 Consequently, when using transfer learning,
                                                                 the pre-trained units fail to slough this pattern
                                                                 which is not always respected in social me-
                                                                 dia. Hence, we found that most of the tokens
                                                                 with an upper-cased first letter are mistakenly
                                                                 predicted as proper nouns (PROPN) in POS,
                                                                 e.g. Award, Charity, Night, etc. and as entities
                                                                 in NER, e.g. Father, Hey, etc., which is con-
Figure 6: The percentage of negative transfer and pos-           sistent with the findings of Seah et al. (2012):
itive transfer brought by the standard fine-tuning adap-         negative transfer is mainly due to conditional
tation scheme compared to supervised training from
                                                                 distribution differences between source and
scratch scheme.
                                                                 target domains.
   We recall that we define positive transfer as the           • Contractions are frequently used in social
percentage of tokens that were wrongly predicted                 media to shorten a set of words. For instance,
by random initialisation (supervised training from               in TPoS dataset, we found that “’s” is in most
scratch), but the standard fine-tuning changed to                cases predicted as a “possessive ending (pos)”
the correct ones, while negative transfer represents             instead of “Verb, 3rd person singular present
the percentage of words that were tagged correctly               (vbz)”. Indeed, in formal English, “’s” is used
by random initialisation, but using standard fine-               in most cases to express the possessive form,
tuning gives wrong predictions. Figure 6 shows
                                                               11
the results on English social media datasets, first               Here we calculate positive and negative transfer at the
                                                            token-level. Thus, the gain shown in Figure 6 for WNUT
tagged with the classic supervised training scheme          dataset does not correspond to the one in Table 2, since the F1
and then using the standard fine-tuning. Blue bars          metric is calculated only on named-entities.
DataSet
          TPoS             Award             ’s          its?                         Mum          wont?            id?             Exactly
                               nn            vbz          prp                           nn           MD              prp                uh
                              nnp            pos          prp$                          uh          VBP              nn                  rb
          ArK              Charity         I’M?         2pac×                          2×         Titans?         wth×               nvr×
                             noun             L          pnoun                           P            Z               !                  R
                            pnoun             E             $                            $            N               P                  V
          TweeBank         amazin•         Night      Angry                         stangs       #Trump        awsome•              bout•
                              adj           noun           adj                        propn         propn            adj               adp
                             noun          propn         propn                         noun           X             intj              verb
          TChunk             luv×         **ROCKSTAR**THURSDAY                       ONLY           Just          wyd×                 id?
                             b-vp                  b-np                                i-np        b-advp           b-np              b-np
                             i-intj                 O                                  b-np          b-np          b-intj              i-np
          Wnut               Hey         Father         &×                          IMO×           UN           Glasgow           Supreme
                               O              O             O                            O            O          b-location         b-person
                           b-person       b-person      i-group                      b-group       b-group        b-group         b-corporation
nn=N=noun=common noun / nnp=pnoun=propn=proper noun / vbz=Verb, 3rd person singular present / pos=possessive ending / prp=personal pronoun /
prp$=possessive pronoun / md=modal / VBP=Verb, non-3rd person singular present / uh=!=intj=interjection / rb=R=adverb / L=nominal + verbal or verbal +
nominal / E=emoticon / $=numerical / P=pre- or postposition, or subordinating conjunction / Z=proper noun + possessive ending / V=verb / adj=adjective /
adp=adposition

Table 3: Examples of falsified predictions by standard fine-tuning scheme when transferring from news
domain to social media domain. Line 1: Some words from the validation-set of each data-set. Line 2: Correct
labels predicted by the classic supervised setting (Random-200). Line 3: Wrong labels predicted by SFT setting.
Mistake type:  for words with first capital letter, • for misspelling, ? for contractions, × for abbreviations.

                       ArK dataset                                       Tchunk dataset                                    Wnut dataset

Figure 7: Correlation results between Φ units’ activations before fine-tuning (columns) and after fine-tuning (rows).
Brighter colours indicate higher correlation.

       e.g. “company’s decision”, but rarely in con-                                      dataset; and luv (love) and wyd (what you
       tractions that are frequently used in social me-                                   doing?) in TChunk dataset.
       dia, e.g. “How’s it going with you?”. Simi-
       larly, “wont” is a frequent contraction for “will                              • Misspellings: Likewise, we found that
       not”, e.g. “i wont get bday money lool”, pre-                                    the standard fine-tuning scheme often gives
       dicted as “verb” instead of “modal (MD)”12                                       wrong predictions for misspelt words, e.g. aw-
       by the SFT scheme. The same for “id”, which                                      some, bout, amazin.
       stands for “I would”.
                                                                                 7.1.2      Quantifying the change of individual
    • Abbreviations are frequently used in social                                           pretrained neurons
      media to shorten the way a word is standardly                              To visualise the bias phenomenon occurring when
      written. We found that the standard fine-                                  using the standard fine-tuning scheme, we quan-
      tuning scheme stumbles on abbreviations pre-                               tify the charge of individual neurons. Precisely,
      dictions, e.g. 2pac (Tupac), 2 (to), ur (your),                            we plot the asymmetric correlation matrix C (The
      wth (what the hell) and nvr (never) in ArK                                 method described in §4.2.1) between the Φ layer’s
  12
     A modal is an auxiliary verb expressing: ability (can),                     units before and after fine-tuning for each social
obligation (have), etc.                                                          media dataset (ArK for POS, TChunk for CK and
WNUT-17 for NER). From the resulting correla-                Unit-196: ArK dataset

tion matrices illustrated in Figure 7, we can ob-
serve the diagonal representing the charge of each
unit, with most of the units having a high charge
(light colour), alluding the fact that every unit after
fine-tuning is highly correlated with itself before
fine-tuning. Hypothesising that high correlation in
the diagonal entails high bias, the results of this
experiment confirm our initial motivation that pre-          Unit-64: ArK dataset

trained units are highly biased to what they have
learnt in the source-dataset, making them limited
to learn some patterns that are specific to the target-
dataset. Our remarks were confirmed recently in
the recent work of Merchant et al. (2020) who also
found that fine-tuning is a “conservative process”.
7.1.3   Visualising the Evolution of Pretrained
        Neurons Stimulus during Fine-tuning                  Figure 8: Individual units activations before and
Here, we give concrete visualisations of the evo-            during fine-tuning from ArK POS dataset. For
lution of pretrained neurons stimulus during fine-           each unit we show Top-10 words activating the said
tuning when transferring from the news domain to             unit. The first column: top-10 words from the source
the social media domain. Following the method                validation-set (WSJ) before fine-tuning, Column 0: top-
                                                             10 words from the target validation-set (ArK) before
described in section 4.2.2, we plot the matrices of
                                                             fine-tuning. Columns 5 to 20: top-10 words from the
top-10 words activating each neuron j, positively            target validation-set during fine-tuning epochs.
   (j)                      (j)
(Abest+ ) or negatively (Abest− ). The results are
plotted in Figure 8 for ArK (POS) dataset and Fig-
ure 9 for TweeBank dataset (POS). Rows represent                      – Unit-64 is sensitive to plural proper
the top-10 words from the target dataset activat-                       nouns on news-domain before fine-
ing each unit, and columns represent fine-tuning                        tuning, e.g. Koreans and Europeans,
epochs; before fine-tuning in column 0 (at this stage                   and also on ArK during fine-tuning, e.g.
the model is only trained on the source-dataset),                       Titans and Patriots. However, in ArK
and during fine-tuning (columns 5 to 20). Addi-                         dataset, “Z” is a special class for “proper
tionally, to get an idea about each unit’s stimulus                     noun + possessive ending”, e.g. Jay’s
on source dataset, we also show, in the first column                    mum, and in some cases the apostrophe is
(Final-WSJ), top-10 words from the source dataset                       omitted, e.g. Fergusons house for Fergu-
activating the same unit before fine-tuning. In the                     son’s house, which thus may bring ambi-
following, we describe the information encoded by                       guity with plural proper nouns in formal
each provided neuron.13                                                 English. Consequently, unit-64, initially
                                                                        sensitive to plural proper nouns, is also
   • Ark - POS: (Figure 8)                                              firing on words from the class “Z”, e.g.
        – Unit-196 is sensitive to contractions con-                    Timbers (Timber’s).
          taining an apostrophe regardless of the
                                                                • Tweebank - POS: (Figure 9)
          contraction’s class. However, unlike
          news, in social media and particularly                      – Unit-37 is sensitive before and during
          ArK dataset, apostrophes are used in dif-                     fine-tuning on plural nouns, such as gaz-
          ferent cases. For instance i’m, i’ll and                      ers and feminists. However, it is also
          it’s belong to the class “L” that stands                      firing on the word slangs because of
          for “nominal + verbal or verbal + nom-                        the s ending, which is in fact a proper
          inal”, while the contractions can’t and                       noun. This might explain the wrong pre-
          don’t belong to the class “Verb”.                             diction for the word slangs (noun instead
  13
     Here we only select some interesting neurons. However              of proper noun) given by the standard
we also found many neurons that are not interpretable.                  fine-tuning scheme (Table 3).
You can also read