BERTGEN: Multi-task Generation through BERT

Page created by Alfredo Caldwell
 
CONTINUE READING
BERTGEN: Multi-task Generation through BERT
B ERT G EN: Multi-task Generation through BERT

                                               Faidon Mitzalis1 , Ozan Caglayan1 , Pranava Madhyastha1 , Lucia Specia1,2
                                                        1
                                                          Department of Computing, Imperial College London, UK
                                                     2
                                                       Department of Computer Science, University of Sheffield, UK
                                            phaedonmit@gmail.com, {o.caglayan,pranava,lspecia}@ic.ac.uk

                                                             Abstract                           2020). Recent work in this direction has predom-
                                                                                                inantly investigated the use of pre-trained mod-
                                            We present B ERT G EN, a novel generative,          els to either initialise Transformer-based encoder-
                                            decoder-only model which extends BERT
                                                                                                decoder models (Imamura and Sumita, 2019; Clin-
arXiv:2106.03484v1 [cs.CL] 7 Jun 2021

                                            by fusing multimodal and multilingual pre-
                                            trained models VL-B ERT and M-B ERT, re-            chant et al., 2019; Yang et al., 2020; Rothe et al.,
                                            spectively. B ERT G EN is auto-regressively         2020) or to distill knowledge for sequence genera-
                                            trained for language generation tasks, namely       tion tasks (Chen et al., 2020).
                                            image captioning, machine translation and              In this work, we present B ERT G EN, which ex-
                                            multimodal machine translation, under a multi-      tends BERT in a generative setting (§ 2.1). This re-
                                            task setting. With a comprehensive set of           sults in a single generator – without a separation be-
                                            evaluations, we show that B ERT G EN outper-        tween the encoder and the decoder – capable of con-
                                            forms many strong baselines across the tasks
                                                                                                suming multiple input modalities and generating in
                                            explored. We also show B ERT G EN’s abil-
                                            ity for zero-shot language generation, where        multiple languages. The latter features are achieved
                                            it exhibits competitive performance to super-       by transferring knowledge from state-of-the-art pre-
                                            vised counterparts. Finally, we conduct abla-       trained models, namely VL-B ERT (Su et al., 2020)
                                            tion studies which demonstrate that B ERT G EN      and multilingual BERT (M-B ERT) (Devlin et al.,
                                            substantially benefits from multi-tasking and       2019). We train B ERT G EN on various tasks, in-
                                            effectively transfers relevant inductive biases     cluding image captioning, machine translation and
                                            from the pre-trained models.
                                                                                                multimodal machine translation, and datasets in
                                                                                                four different languages (§ 2.2).
                                        1   Introduction
                                                                                                   Based on a number of experiments, our find-
                                        Recent work in unsupervised and self-supervised         ings (§ 3) show that B ERT G EN (i) is surprisingly
                                        pre-training has revolutionised the field of natural    versatile as it is capable of describing images and
                                        language understanding (NLU), resulting in high         performing translation in unimodal and multimodal
                                        performance ceilings across multiple tasks (Devlin      settings, across all languages, (ii) generalises well
                                        et al., 2019; Yang et al., 2019; Dong et al., 2019).    across zero-shot image captioning, multimodal ma-
                                        The recent success of language model pre-training       chine translation, and out-of-domain news trans-
                                        with masked language modelling (MLM) such as            lation tasks, and finally (iii) is parameter efficient
                                        BERT (Devlin et al., 2019) further paved the way        when compared to state-of-the-art models for each
                                        for more complex approaches that combine lan-           of the tasks combined together.
                                        guage pre-training with images (Tan and Bansal,
                                        2019; Su et al., 2020; Lu et al., 2020), video (Sun     2     Method
                                        et al., 2019), and speech (Chuang et al., 2020).        In this section, we describe B ERT G EN and the tasks
                                        Most of these approaches follow a task-specific         we explore. We then detail the baselines and SoTA
                                        fine-tuning step after the model is pre-trained.        systems that we compare against.
                                           However, there has been little work on exploiting
                                        pre-trained MLMs for natural language generation        2.1   Model
                                        (NLG) tasks. Previous work argues that the MLM          This section details the main aspects of B ERT G EN
                                        objective is ill-suited for generation tasks such as    that distinguish it from the existing work on vision
                                        machine translation (Yang et al., 2019; Rothe et al.,   & language pre-training.
BERTGEN: Multi-task Generation through BERT
y3                                 Input Embeddings

  [DE] [CLS]      x1    x2          x7 [SEP] y1         y2 [MASK] [SEP] [IMG] [IMG]                [IMG] [END]

                                                                                                        RoI Features

                                                                                               Segment Embeddings

                                 Segment A                                                 Segment B
                                                                                              Positional Embeddings

    1       2                                          12     13       14     15      15               15     16

                                                   Transformer
                                                            MLM Head

Figure 1: A view of the B ERT G EN model during the training of an MMT sample: solid and dashed borders around
the images represent full-image features and regional features, respectively. At test time, the most likely token
y3 = argmax(P (yt |x, v, y
BERTGEN: Multi-task Generation through BERT
MASK                          self-attentive representations of early positions are
                                                                    naturally re-computed, in contrast to typical Trans-
                                             MASK
                                                                    former decoders. Finally, due to how inputs/outputs
                                                    MASK            are represented in a single stream and encoded
                                                                    through shared self-attention, B ERT G EN enforces
                                                           MASK
                                                                    an inductive bias towards a truly multi-modal and
                                                           STOP
                                                                    multi-lingual representation space.
Figure 3: A look at B ERT G EN’s self-attention: the con-
nections denote that self-attentive representations are             Target language specifiers. Finally, to select the
re-computed in every step. The generation ends when                 language during generation, input sequences begin
STOP is predicted. The smileys refer to RoI features.               with special target language specifiers (Ha et al.,
                                                                    2016; Johnson et al., 2017) (Figure 1). The speci-
                                                                    fier is task-agnostic, i.e. the same specifier [DE] is
bounding box coordinates). When encoding the
                                                                    used both when captioning into German and when
non-visual positions, the same RoI feature vector
                                                                    translating into German.
for the full image is repeated (see Figure 1). We
note that we do not fine-tune the object detector                   Training & hyper-parameters. We extend2 the
during training.                                                    base configuration of VL-B ERT which is a Trans-
Sequence unrolling. An important aspect of                          former with 12 self-attention layers and 12 heads.
B ERT G EN is that it does not explicitly distinguish               The model and feed-forward dimensions are 768
between the encoder and the decoder blocks usu-                     and 3072, respectively. On a single 32GB V100
ally seen in sequence-to-sequence models. This is                   GPU, one epoch (§ 3) takes approximately two days
accomplished by formalising both encoding and                       to complete as we could only fit one example per
generation using the MLM framework. Formally,                       task (i.e. batch size equal to 13) into the memory3 .
let us consider the MMT task and define the max-                    We use AdamW optimiser (Loshchilov and Hutter,
imum log-likelihood objective for a given triplet                   2019) with base learning rate set to 1.3×10−5 . The
{x(i) , v(i) , y(i) } where the target y(i) has n tokens:           learning rate is warmed up in the first 16K steps
                                                                    and then decays linearly. We set the weight decay
               n
     (i)
               X        
                             (i)                (i)
                                                                   to 10−4 . During training, we let the model update
   L       =         log P (yt | x(i) ; v(i) ; y
BERTGEN: Multi-task Generation through BERT
Name           Type      Task        Sents   Augm.        2.2.2   Multimodal Machine Translation
  F LICKR 8 K     IC     IM→TR       13,828     263K        Multimodal Machine Translation (MMT) attempts
  M ULTI 30 K    MMT     DE→EN       29,000     464K        to improve MT quality by incorporating informa-
                         FR→EN                  464K        tion from modalities other than language (Suluba-
                         EN→FR                  560K        cak et al., 2020). In our case, we train B ERT-
  M ULTI 30 K    MMT     EN→DE       29,000     582K
                                                            G EN for E N↔D E and E N↔F R MMT tasks and
  F LICKR 30 K    IC     IM→EN      145,000    2.39M        use the M ULTI 30 K dataset, the main dataset for
                         IM→DE                 2.48M
                                                            image-informed translation, which provides cap-
  I WSLT          MT     DE→EN      158,388    3.85M        tion translations for F LICKR 30 K images in German
                         EN→DE                 4.43M
                                                            and French. To evaluate B ERT G EN on MMT tasks,
  I WSLT          MT     FR→EN      163,328    4.01M        we use the original 2016 test set which contains
                         EN→FR                 4.78M
                                                            1,000 examples.
  S ETIMES        MT     TR→EN      185,318    6.01M           For a comprehensive comparison with previous
                         EN→TR                 8.40M
                                                            work, we train a SoTA recurrent MMT (Caglayan
Table 1: Training statistics of B ERT G EN: the last col-
                                                            et al., 2020) solely on the M ULTI 30 K dataset,
umn is the number of samples after sequence unrolling.      which applies a secondary (visual) attention in the
                                                            decoder over the RoI features i.e. the same features
                                                            that are also used by B ERT G EN (§ 2.1). There
                                                            are two GRU (Cho et al., 2014) layers in both the
2.2.1   Image Captioning                                    encoder and the decoder and the embedding & hid-
                                                            den dimensions in the model are set to 200 and 320,
                                                            respectively. Each model has ∼5.6M parameters
Image captioning (IC) involves describing images            excluding the word embeddings.
in a specified natural language. We train B ERT-               Besides the state-of-the-art constrained recur-
G EN for English, German and Turkish caption-               rent MMT model described above, we further
ing tasks. Specifically, we use the F LICKR 30 K            compare B ERT G EN – which is trained on various
dataset (Young et al., 2014) that provides 29K train-       other MT and IC corpora – to an unconstrained
ing images, each with five English captions col-            Transformer-based MMT trained on ∼9M addi-
lected through crowd-sourcing. The validation and           tional E N→D E sentences (Libovický, 2019)4 in
test sets contain approximately 1K images each.             addition to M ULTI 30 K.
We use the M ULTI 30 K dataset (Elliott et al., 2016),
which annotates F LICKR 30 K images with five Ger-          2.2.3   Text-only Machine Translation
man captions. Finally, we use the TASVIR E T                We incorporate six text-only MT tasks into our
dataset (Unal et al., 2016) which provides two              training protocol. We use E N↔D E and E N↔F R
Turkish captions for each of the 8,092 images in            MT datasets from IWSLT’14 (Cettolo et al., 2012)
the F LICKR 8 K dataset (Rashtchian et al., 2010).          which consists of TED Talks’ subtitles and their
Since F LICKR 8 K is a subset of F LICKR 30 K, we           translations. We take the prepare-iwslt14
create a new split of TASVIR E T to avoid data leak-        recipe from FAIRSEQ (Ott et al., 2019) to prepare
age between training and test splits. The resulting         the dev and test sets. This yields an E N↔D E test
training, validation and test splits contain 6914,          set of 6,750 sentences which consists of dev2010,
543, and 543 images, respectively.                          dev2012.TEDX, tst2010, tst2011 and tst2012. Sim-
                                                            ilarly, the E N↔F R test set consists of dev2010,
   To evaluate B ERT G EN’s performance on IC,
                                                            tst2010, tst2011 and tst2012, which amounts to
we compare it against previous work with strong
                                                            4,493 sentences.
performance on COCO (Chen et al., 2015) and
F LICKR 30 K. More precisely, A DAPTIVE ATTEN -                For E N↔T R directions, we use the SE-
TION (S ENTINEL ) (Lu et al., 2017), which uses
                                                            TIMES2 (Tiedemann, 2012) news dataset for train-
a sentinel token to distinguish between visual and          ing. For development and test sets, we take the
non-visual representations, and N EURAL BABY                official WMT test sets (Bojar et al., 2018), namely,
TALK (N BT ), which follows a slot-filling approach         newstest2016 and newstest2017 as the development
through explicit object region information (Lu                 4
                                                                 We obtained test set outputs from the author and pre-
et al., 2018).                                              processed with M-B ERT tokeniser to ensure comparability.
BERTGEN: Multi-task Generation through BERT
set (6,007 sentences), and newstest2018 (6,000 sen-                       TASK     A PPROACH       BL     MT      CR
tences) as the test set. Both IWSLT and SETIMES2
                                                                       F30K E N    B ERT G EN     27.0   23.2    0.587
corpora are medium-scale resources often used in
                                                                                   S ENTINEL‡     25.1   20.4    0.531
MT research community, and have much harder
                                                                                   N BT‡          27.1   21.7    0.575
test sets than the MMT and IC tasks, due to a sig-
nificant domain shift.                                                 C OCO E N   B ERT G EN     15.9   20.4    0.487
   Finally, for each translation direction, we train                               N BT‡          34.7   27.1    1.072
a Transformer NMT model (Vaswani et al., 2017)                         F30K F R    B ERT G EN      5.2   18.1    0.397
using the I WSLT-D E -E N recipe of the FAIRSEQ                         F8K T R    B ERT G EN      8.5   14.5    0.363
toolkit (Ott et al., 2019). This recipe has six en-                    F30K D E    B ERT G EN     17.8   34.2    0.500
coders and six decoders, each equipped with 4-head
self-attention layers. The model and feed-forward               Table 2: BLEU (B L), METEOR (M T) and CIDEr (C R)
                                                                scores for image captioning: gray background indicates
dimensions are set to 512 and 1024, respectively.
                                                                zero-shot generation whereas ‡ denotes the systems de-
Each model has ∼31.5M parameters excluding the                  coded with beam search.
word embeddings. Since B ERT G EN is a general
purpose multilingual and multimodal generator, we
expect it to perform in the same ballpark as these              given that B ERT G EN is not trained on COCO: our
strong NMT baselines, but not necessarily be SoTA               model lags behind N BT (w/ beam search) by 6.7
compared to novel & sophisticated NMT models,                   METEOR.
which also make use of a lot more training data.                   For zero-shot French captioning (F30K F R), we
                                                                resort to the reference MMT translations from the
3       Results and Findings                                    M ULTI 30 K E N→F R task, as there are no human
                                                                references for French. Although this is problem-
We train B ERT G EN on lowercased sentences for
                                                                atic as the metrics will penalise captions that are
45 epochs, after which the overall performance
                                                                not translations of English captions, we provide
on the tasks reached a plateau. We define one
                                                                the scores to show that the zero-shot outputs are
B ERT G EN epoch as a single pass over all of the
                                                                valid descriptions. We note that the low range
training data for the M ULTI 30 K E N→D E MMT
                                                                of scores reported here is also due to having one
task and denote this task as the reference task. We
                                                                reference caption instead of five references7 as in
use greedy search for all systems that we trained
                                                                F LICKR 30 K. Finally we report results for our cus-
and merge back the word pieces before evaluation.
                                                                tom Turkish split (§ 2.2.1) (F30K T R) and Ger-
We compute tokenised5 BLEU (Papineni et al.,
                                                                man (F30K D E). Even though there are no com-
2002), METEOR (Denkowski and Lavie, 2014)
                                                                parable results in the literature for these three tasks,
and CIDEr (Vedantam et al., 2015) using coco-
                                                                we demonstrate through some qualitative examples
caption6 . In what follows, we provide detailed
                                                                that B ERT G EN produces sensible outputs.
quantitative and qualitative findings.
                                                                Qualitative examples. We now focus on a few
3.1      Image Captioning                                       examples to examine the multilingual image cap-
Table 2 provides an overview of B ERT G EN’s image              tioning ability of B ERT G EN in action (Table 3).
captioning performance on different test sets and               For the first image, all captions are almost the same
languages. First of all, on English F LICKR 30 K,               as the image has few salient points. For the second
B ERT G EN is clearly able to outperform strong cap-            image however, we observe much more variation
tioning models (§ 2.2.1) S ENTINEL (Lu et al., 2017)            across captions, in line with the complexity of the
and N BT (Lu et al., 2018), even though they use                scene. We are particularly surprised by the zero-
beam search for decoding. On COCO (Chen et al.,                 shot French captioning performance, a task that
2015), an image captioning corpus much larger                   B ERT G EN is not trained for at all. Upon manual
and diverse than F LICKR 30 K, we evaluate B ERT-               inspection, we noticed that the captions are often
G EN on Karpathy’s test split (Karpathy and Fei-                short, objective gists of the images. These obser-
Fei, 2015) and notice that the scores are reasonable            vations also hold for the captions generated for the
    5                                                              7
     Since M-B ERT is aggressive on splitting apostrophes and        As a reference, evaluating English captions using one
hyphens, our results may slightly differ from other work.       reference at a time, yields 7.9 BLEU on average, compared to
   6
     https://github.com/tylin/coco-caption                      27.0 BLEU in Table 2.
BERTGEN: Multi-task Generation through BERT
MMT       A PPROACH                BL     MT
                                                                    EN → DE   B ERT G ENF              42.2   61.6
                                                                              Libovický (2019)F‡      40.8   59.2
                                                                              Caglayan et al. (2020)   37.8   56.9
 EN   a man wearing a hat and glasses.
                                                                              FAIRSEQ N MT             37.5   56.1
 DE   ein mann mit hut und brille.
      a man with hat and glasses                                    EN → FR   B ERT G ENF              68.0   81.2
 TR   şapkalı ve gözlüklü bir adam.                                       Libovický (2019)F‡      63.4   77.3
      a man with a hat and glasses.                                           FAIRSEQ N MT             61.5   75.5
 FR   un homme avec un chapeau et des lunettes.
                                                                              Caglayan et al. (2020)   61.0   75.3
      a man with a hat and glasses.
                                                                    DE→ FR    B ERT G ENF              44.8   64.1
                                                                              Caglayan et al. (2020)   43.8   62.1
                                                                              FAIRSEQ N MT             41.7   60.7
                                                                    FR → DE   B ERT G ENF              35.1   56.9
                                                                              FAIRSEQ N MT             35.1   53.6
 EN   two men are on a rooftop working on something.                          Caglayan et al. (2020)   33.5   53.1
 DE   zwei männer arbeiten auf einem dach.
      two men working on a roof                               Table 4: BLEU (B L) and METEOR (M T) scores for
 TR   iki binanın inşasında oturmuş, yanyana                MMT: zero-shot systems are highlighted with gray.
      yerde duran iki kişi.                                  Systems marked with F and ‡ denote the use of aux-
      two people seated in the construction of two            iliary resources (i.e. unconstrained) and beam-search
      buildings, standing next to each other on the ground.   decoding, respectively.
 FR   trois ouvriers du bâtiment construisent un toit.
      three construction workers build a roof.

                                                              3.2    Multimodal Machine Translation
                                                              Table 4 summarises B ERT G EN’s performance on
                                                              MMT. First of all, B ERT G EN consistently outper-
                                                              forms the Transformer-based FAIRSEQ NMT mod-
 EN   a man in a red shirt and helmet is riding a
      motorbike on a dirt road.
                                                              els and the recurrent MMT (Caglayan et al., 2020)
 DE   ein mann fährt mit einem motorrad auf einem            models on both the E N→D E and the E N→F R
      weg an einem fluß entlang.                              language pairs. Furthermore, B ERT G EN is also
      a man rides a motorcycle on a path along a river.       substantially better than a state-of-the-art uncon-
 TR   çamurlu bir yolda motoruyla ilerlemekte olan kırmızı
      üstlü bir adam ve arkasındaki dağ manzarası.
                                                              strained MMT (Libovický, 2019) model trained on
      A man in a red top riding his bike down a muddy         a ∼6x larger parallel corpus.
      road with a mountain landscape behind him.
 FR   un homme avec un casque fait du motocross.              Adversarial evaluation. Following Elliott
      a man with a helmet rides motocross.                    (2018), we probe B ERT G EN’s ability for integrat-
                                                              ing multiple modalities effectively. Specifically,
Table 3: Multilingual image captioning examples: The
                                                              we decode translations by shuffling {image, source
italicised sentences are Google Translate translations
of D E ,T R ,F R sentences into English. Gray background
                                                              caption} mappings so that the images do not
indicates zero-shot outputs. The last example is from         correspond to the sentences to be translated. The
C OCO while the others are from F LICKR 30 K.                 E N→D E results showed that the incongruence
                                                              leads to 1.1 and 0.9 point drops in BLEU and
                                                              METEOR, respectively. For E N→F R, the drops
                                                              are much more prominent with 3.1 and 2.3 points
                                                              again for BLEU and METEOR. This indicates
COCO test set, as we can see in the third exam-               that the features are not ignored at all, unlike in
ple. A set of additional examples in the Appendix             (Caglayan et al., 2019), where they showed that
shows that B ERT G EN does not simply retrieve cap-           sequence-to-sequence MMT models can learn to
tion translations learned from the E N→F R task.              ignore the images when the linguistic signal is
Overall, both quantitative and qualitative results            sufficient to perform the task.
provide evidence of the utility of multimodal and
multilingual initialisation as well as the efficacy of        Zero-shot performance. The results in Table 4
knowledge transfer across different tasks for image           show the surprising ability of B ERT G EN to per-
captioning.                                                   form MMT on directions unseen during training.
BERTGEN: Multi-task Generation through BERT
FAIRSEQ              B ERT G EN                                     D E→F R         F R→D E
 TASK                            BL   MT              BL      MT                                     BL  MT          BL  MT
 I WSLT E N→D E                  27.4       47.1      27.8         48.4              B ERT G EN     19.6 40.5       13.1 36.7
 I WSLT D E→E N                  33.6       33.8      35.6         34.7                TARTU‡       39.5 59.0       26.3 47.3
 I WSLT E N→F R                  41.0       59.8      40.2         60.5                 M SRA‡      46.5 64.2       38.2 56.4
 I WSLT F R→E N                  39.1       36.4      40.0         36.8
 S ETIMES E N→T R                14.1       18.9      13.5         19.1       Table 6: Zero-shot B ERT G EN performance on
 S ETIMES T R→E N                17.3       25.8      19.0         26.9       WMT’19 test set: TARTU and MSRA systems are not
                                                                              zero-shot as they are trained on D E↔FR corpora. Sys-
Table 5: Comparison of text-only MT performance of                            tems marked with ‡ are beam search outputs.
B ERT G EN to each dedicated FAIRSEQ NMT system:
B ERT G EN outperforms single models in most cases.                           cific task’s training set has been seen by the models.
              IWSLT DE    EN                    IWSLT EN       DE
                                                                              B ERT G EN is trained for 45 reference epochs (§ 3),
40                                                                            and this corresponds to only a few complete passes
30
20
                                                                              over the training sets of NMT tasks8 . This is in
10                                                                            contrast to the single-task systems that usually re-
 0                                                                            quire a large number of epochs for convergence.
60
              IWSLT EN    FR                    IWSLT FR       EN             We notice a general trend and observe that B ERT-
45                                                                            G EN tends to outperform single-task systems usu-
30                                                                            ally after only a few passes over the corresponding
15                                                                            training set. Many factors could be contributing to
 0
                                                                              this observation such as sequence unrolling, multi-
           SETIMES EN      TR                SETIMES TR           EN
20                                                                            tasking, shared input space or relevant inductive
15
                                                                              biases transferred from M-B ERT. We partly ad-
10
                                                            Fairseq           dress these in the ablation studies (§ 3.4) and leave
 5
                                                           BertGen
 0                                                                            further investigation to future work.
      0   5    10   15   20 25   30     0   5    10   15     20     25   30
                                                                              Zero-shot performance. We use the D E↔F R
Figure 4: B ERT G EN’s learning efficiency on MT: val-
                                                                              test set from the WMT’19 shared task on news
idation scores are plotted against the number of full
passes completed by B ERT G EN and each FAIRSEQ
                                                                              translation (Barrault et al., 2019) to assess the zero-
model, over the corresponding task’s training set. Best                       shot translation capability of B ERT G EN. This test
checkpoints’ test set performances are given in Table 5.                      set includes 1,701 sentences from news data regard-
                                                                              ing European Elections. We compare our results
                                                                              to two shared task systems, namely TARTU (base-
Moreover, the zero-shot performance surpasses                                 line) and M SRA (state-of-the-art) (Barrault et al.,
strong MMT and NMT systems by up to 2 and                                     2019), after re-tokenising them accordingly with
3.3 METEOR for D E→F R and F R→D E, respec-                                   M-B ERT9 . Although B ERT G EN is expected to
tively. Similar to the image captioning results, this                         obtain lower scores than the dedicated WMT sys-
demonstrates the potential of B ERT G EN to gener-                            tems due to the domain mismatch of the test set,
alise over a variety of language pairs and tasks.                             we consider both the quantitative (Table 6) and the
                                                                              qualitative results (Table 7) extremely encouraging.
3.3       Machine Translation
First, we compare B ERT G EN’s performance to                                 3.4    Ablation Studies
each task-specific FAIRSEQ system. According                                  3.4.1 Impact of initialisation
to Table 5, we observe that the translation quality                           We train single-task MMT systems on the
of B ERT G EN is generally superior compared to the                           M ULTI 30 K E N→D E language pair. Specifically,
strong FAIRSEQ systems, especially in METEOR,                                 we begin with a baseline system which is initialised
where B ERT G EN leads in all pairs.                                          with random weights. We then train a second base-
   Second, we look at the learning efficiency by                              line where only the visual processing layers are
comparing the training curves between B ERT G EN                                 8
                                                                                   For example, only ∼3 passes over S ETIMES E N→T R.
and each task-specific FAIRSEQ system (Figure 4).                                9
                                                                                   TARTU is the baseline and M SRA is the best performing
Here, the x axis represents how many times the spe-                           system for the shared task
BERTGEN: Multi-task Generation through BERT
B ERT G EN:       la décision est tombée au 70ème anniversaire de ma femme.
                    the decision fell on my wife’s 70th birthday.
   W MT R EF        la décision est tombée le jour du 70ème anniversaire de ma femme.
                    the decision fell on my wife’s 70th birthday.
  B ERT G EN:       en espagne, on s’est malheureusement habitué à une rôle double et passive.
                    in spain, we unfortunately got used to a double and passive role.
   W MT R EF        en espagne, on s’est malheureusement habitué à un rôle secondaire, passif.
                    in spain, we unfortunately got used to a secondary, passive role.
  B ERT G EN:       pas parce que le président du fdp a dit quelque chose qu’ ils ont défaillant leur vote.
                    not because the fdp president said something that they missed their vote.
   W MT R EF        ce n’ est pas parce que le président fédéral du fdp a dit quelque chose qu’ ils ont refusé d’ approuver.
                    it is not because the federal president of the fdp said something that they refused to approve.

Table 7: Zero-shot D E→F R translations on WMT’19 test set. The italicised sentences are Google Translate trans-
lations of French sentences into English. Bold indicates important differences between B ERT G EN and references.

         45                                                                    50
                                                                                        Single Task
         40                                                                              Multi Task
                                                                               45
         35
                                                                        BLEU
         30                                                                    40
  BLEU

         25
                                                                               35
         20
                                             Random
                                                                               30
         15                               Visual Only                               0            5           10       15     20
                                               Hybrid                                          Passes over EN-DE MMT data
         10
              0        5           10      15             20
                     Passes over EN-DE MMT data
                                                                    Figure 6: Validation scores on M ULTI 30 K E N→D E
                                                                    MMT for the multi-tasking ablation: The default
Figure 5: Validation scores on M ULTI 30 K E N→D E                  multi-task B ERT G EN outperforms the single-task one.
MMT for the initialisation ablation: Hybrid initialisa-
tion is the most beneficial strategy for B ERT G EN.
                                                                    of catastrophic forgetting (French, 1999). Based
                                                                    on these observations, we expect similar model
transferred from VL-B ERT. Finally, we train a                      behavior to hold for other tasks.
third baseline that is initialised similar to B ERT-
G EN, i.e. using the hybrid initialisation (§ 2.1).                 4      Related Work
Figure 5 compares the validation BLEU scores of
these three systems. We observe that the benefits of                4.1        Multimodal multilingual pre-training
knowledge transfer from pre-trained models are in-                  Research in NLP and related fields has been in-
crementally positive, however, B ERT G EN’s hybrid                  creasingly focusing on transfer learning approaches
initialisation outperforms the other two ablations.                 where a model is first pre-trained on a data-rich
                                                                    task, and then transferred to downstream tasks (Mc-
3.4.2         Impact of multi-task training                         Cann et al., 2017; Peters et al., 2018; Devlin et al.,
We now remove the multi-tasking aspect from                         2019). This framework presumably allows the
B ERT G EN to investigate the extent to which the per-              model to capture useful inductive biases that gen-
formance improvements are related to other tasks.                   eralise to a variety of NLP tasks, often after per-
Similar to § 3.4.1, we focus on the M ULTI 30 K                     forming a task-specific fine-tuning (Raffel et al.,
E N→D E MMT task and train a single-task, hybrid-                   2020). Of these, the most relevant studies to our
initialised B ERT G EN. Figure 6 compares the vali-                 work are BERT (Devlin et al., 2019) and its multi-
dation BLEU scores obtained by the default B ERT-                   lingual version M-B ERT, which pre-train a Trans-
G EN and the single-task variant. We observe that                   former (Vaswani et al., 2017) on large monolin-
B ERT G EN benefits from multi-task training and,                   gual corpora using the masked language modelling
more importantly, does not seem to exhibit patterns                 (MLM) objective.
BERTGEN: Multi-task Generation through BERT
Recent research has also attempted to combine        Knight, 2016; Firat et al., 2016). The multi-task
linguistic inputs with other modalities such as vi-     (and zero-shot) generation ability of B ERT G EN is
sion and speech, to achieve a grounded understand-      mostly inspired by Ha et al. (2016) and Johnson
ing of meaning. Successful approaches including         et al. (2017). Both of these introduced target lan-
LXMERT (Tan and Bansal, 2019), VL-BERT (Su              guage specifiers to select the output language when
et al., 2020) and others (Lu et al., 2019; Li et al.,   decoding translations from their model.
2020a,b) achieve this by combining BERT’s MLM              Our multilingual & multimodal take on multi-
objective with auxiliary tasks such as masked re-       task generation is most similar to Kaiser et al.
gion classification and image sentence matching,        (2017), where a single Transformer model is
and pre-train their model on large-scale image cap-     trained on different tasks including image cap-
tioning corpora (Chen et al., 2015; Sharma et al.,      tioning, object classification, machine translation,
2018). Similarly, SpeechBERT extends BERT by            speech recognition and parsing. However, their
jointly training on speech and text data (Chuang        architecture depends on particular structures such
et al., 2020). Although SoTA results are reported       as encoders, decoders, modality-specific networks
by these approaches, they focus on unimodal and         and I/O mixers, unlike B ERT G EN which does not
multimodal natural language understanding (NLU)         require task-specific modules.
tasks, with a strong emphasis in English. The back-
bone of B ERT G EN combines VL-BERT (Su et al.,         5   Conclusions
2020) with M-B ERT (Devlin et al., 2019) to realise
                                                        In this paper, we presented B ERT G EN, a novel gen-
a multilingual and multimodal generator that can
                                                        erative, decoder-only model which extends BERT
be used for a diverse set of generative tasks and
                                                        by combining multimodal and multilingual pre-
languages rather than NLU tasks.
                                                        trained models. Our findings show that B ERT G EN
4.2   Pre-training for generative tasks                 obtains strong performance on a variety of gen-
                                                        erative tasks and further generalises over unseen
Previous work has studied how to benefit from
                                                        tasks. Importantly, our model demonstrates the po-
pre-trained BERT models in generative tasks such
                                                        tential for general-purpose (instead of task-specific)
as NMT (Imamura and Sumita, 2019; Clinchant
                                                        generation that is above and beyond the traditional
et al., 2019; Zhu et al., 2020). B ERT G EN differs
                                                        pre-training and fine-tuning practices. B ERT G EN is
from these as it is not fine-tuned for a particular
                                                        also parameter efficient as it has 89.3M total param-
MT corpus and it exhibits multi-lingual and multi-
                                                        eters and is trained on thirteen tasks encompassing
modal properties for general purpose generation.
                                                        MT, multimodal MT and image captioning. On the
   Another related branch of work explores pre-
                                                        other hand, each of the single-task FAIRSEQ NMT
training strategies specific to sequence-to-sequence
                                                        baselines has 31.5M parameters.
tasks. This includes MASS (Song et al., 2019),
                                                           Our ablation studies show that B ERT G EN is able
which exploits an encoder-decoder framework
                                                        to efficiently transfer relevant inductive biases from
with the MLM objective for task-specific gener-
                                                        the pre-trained models and benefits from multi-task
ative pre-training and UniLM (Dong et al., 2019),
                                                        learning without suffering from catastrophic for-
which introduces uni-directional, bi-directional and
                                                        getting. We hope that these findings will motivate
sequence-to-sequence LM objectives by carefully
                                                        future research in exploiting more sophisticated
adjusting the self-attention masks during train-
                                                        pre-trained models in place of M-B ERT and VL-
ing. Zhou et al. (2020) extend UniLM to vision
                                                        B ERT and others.
& language pre-training using Conceptual Cap-
tions (Sharma et al., 2018) as the pre-training         Acknowledgments
dataset. However, these models require a further
fine-tuning step for generative tasks, unlike B ERT-    This paper is a follow-up work to the MSc. Thesis
G EN that is trained only once.                         of Faidon Mitzalis, co-supervised by Prof. Lucia
                                                        Specia and Dr. Ozan Caglayan. Lucia Specia,
4.3   Multi-task learning for generation                Pranava Madhyastha and Ozan Caglayan received
Several approaches exist for multi-task learning &      support from MultiMT project (H2020 ERC Start-
generation (Dong et al., 2015; Luong et al., 2016)      ing Grant No. 678017). Lucia Specia also received
in NLP, especially in multilingual NMT, where           support from the Air Force Office of Scientific Re-
tasks denote different language pairs (Zoph and         search (under award number FA8655-20-1-7006).
BERTGEN: Multi-task Generation through BERT
References                                                Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
                                                            cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Peter Anderson, Xiaodong He, Chris Buehler, Damien          Schwenk, and Yoshua Bengio. 2014. Learning
  Teney, Mark Johnson, Stephen Gould, and Lei               phrase representations using RNN encoder–decoder
  Zhang. 2018. Bottom-up and top-down attention for         for statistical machine translation. In Proceedings of
  image captioning and visual question answering. In        the 2014 Conference on Empirical Methods in Nat-
  Proceedings of the IEEE conference on computer vi-        ural Language Processing (EMNLP), pages 1724–
  sion and pattern recognition, pages 6077–6086.            1734, Doha, Qatar. Association for Computational
                                                            Linguistics.
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà,
  Christian Federmann, Mark Fishel, Yvette Gra-
  ham, Barry Haddow, Matthias Huck, Philipp Koehn,        Yung-Sung Chuang, Chi-Liang Liu, Hung yi Lee, and
  Shervin Malmasi, Christof Monz, Mathias Müller,          Lin shan Lee. 2020. SpeechBERT: An Audio-and-
  Santanu Pal, Matt Post, and Marcos Zampieri. 2019.        Text Jointly Learned Language Model for End-to-
  Findings of the 2019 conference on machine transla-       End Spoken Question Answering. In Proc. Inter-
  tion (WMT19). In Proceedings of the Fourth Con-           speech 2020, pages 4168–4172.
  ference on Machine Translation (Volume 2: Shared
  Task Papers, Day 1), pages 1–61, Florence, Italy. As-   Stephane Clinchant, Kweon Woo Jung, and Vassilina
  sociation for Computational Linguistics.                   Nikoulina. 2019. On the use of BERT for neu-
                                                             ral machine translation. In Proceedings of the 3rd
Ondřej Bojar, Christian Federmann, Mark Fishel,            Workshop on Neural Generation and Translation,
  Yvette Graham, Barry Haddow, Philipp Koehn, and            pages 108–117, Hong Kong. Association for Com-
  Christof Monz. 2018. Findings of the 2018 con-             putational Linguistics.
  ference on machine translation (WMT18). In Pro-
  ceedings of the Third Conference on Machine Trans-      Michael Denkowski and Alon Lavie. 2014. Meteor uni-
  lation: Shared Task Papers, pages 272–303, Bel-           versal: Language specific translation evaluation for
  gium, Brussels. Association for Computational Lin-        any target language. In Proceedings of the Ninth
  guistics.                                                Workshop on Statistical Machine Translation, pages
                                                            376–380, Baltimore, Maryland, USA. Association
Ozan Caglayan, Julia Ive, Veneta Haralampieva,              for Computational Linguistics.
  Pranava Madhyastha, Loı̈c Barrault, and Lucia Spe-
  cia. 2020. Simultaneous machine translation with vi-    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
  sual context. In Proceedings of the 2020 Conference        Kristina Toutanova. 2019. BERT: Pre-training of
  on Empirical Methods in Natural Language Process-          deep bidirectional transformers for language under-
  ing (EMNLP), pages 2350–2361, Online. Associa-             standing. In Proceedings of the 2019 Conference
  tion for Computational Linguistics.                        of the North American Chapter of the Association
                                                             for Computational Linguistics: Human Language
Ozan Caglayan, Pranava Madhyastha, Lucia Specia,            Technologies, Volume 1 (Long and Short Papers),
  and Loı̈c Barrault. 2019. Probing the need for visual      pages 4171–4186, Minneapolis, Minnesota. Associ-
  context in multimodal machine translation. In Pro-         ation for Computational Linguistics.
  ceedings of the 2019 Conference of the North Amer-
  ican Chapter of the Association for Computational       Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and
  Linguistics: Human Language Technologies, Vol-            Haifeng Wang. 2015. Multi-task learning for mul-
  ume 1 (Long and Short Papers), pages 4159–4170,           tiple language translation. In Proceedings of the
  Minneapolis, Minnesota. Association for Computa-          53rd Annual Meeting of the Association for Compu-
  tional Linguistics.                                       tational Linguistics and the 7th International Joint
                                                            Conference on Natural Language Processing (Vol-
Mauro Cettolo, Christian Girardi, and Marcello Fed-
                                                            ume 1: Long Papers), pages 1723–1732, Beijing,
 erico. 2012. WIT3: Web inventory of transcribed
                                                            China. Association for Computational Linguistics.
 and translated talks. In Proceedings of the 16th An-
 nual conference of the European Association for Ma-
 chine Translation, pages 261–268, Trento, Italy. Eu-     Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-
 ropean Association for Machine Translation.                aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
                                                            and Hsiao-Wuen Hon. 2019. Unified language
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-                model pre-training for natural language understand-
  ishna Vedantam, Saurabh Gupta, Piotr Dollar, and          ing and generation. In Advances in Neural Informa-
  C. Lawrence Zitnick. 2015. Microsoft COCO Cap-            tion Processing Systems, volume 32, pages 13063–
  tions: Data Collection and Evaluation Server.             13075. Curran Associates, Inc.

Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu,           Desmond Elliott. 2018. Adversarial evaluation of mul-
  and Jingjing Liu. 2020.       Distilling knowledge        timodal machine translation. In Proceedings of the
  learned in BERT for text generation. In Proceedings       2018 Conference on Empirical Methods in Natu-
  of the 58th Annual Meeting of the Association for         ral Language Processing, pages 2974–2978, Brus-
  Computational Linguistics, pages 7893–7905, On-           sels, Belgium. Association for Computational Lin-
  line. Association for Computational Linguistics.          guistics.
Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-      Gao. 2020b. Oscar: Object-Semantics Aligned Pre-
  cia Specia. 2016. Multi30K: Multilingual English-         training for Vision-Language Tasks. In Computer Vi-
  German image descriptions. In Proceedings of the          sion – ECCV 2020, pages 121–137, Cham. Springer
  5th Workshop on Vision and Language, pages 70–            International Publishing.
  74, Berlin, Germany. Association for Computational
  Linguistics.                                            Jindřich Libovický. 2019. Multimodality in Machine
                                                             Translation. Ph.D. thesis, Charles University, Fac-
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.               ulty of Mathematics and Physics, Institute of Formal
  2016. Multi-way, multilingual neural machine trans-        and Applied Linguistics, Prague, Czechia.
  lation with a shared attention mechanism. In Pro-
  ceedings of the 2016 Conference of the North Amer-      Ilya Loshchilov and Frank Hutter. 2019. Decoupled
  ican Chapter of the Association for Computational          weight decay regularization. In International Con-
  Linguistics: Human Language Technologies, pages            ference on Learning Representations.
  866–875, San Diego, California. Association for
  Computational Linguistics.                              J. Lu, C. Xiong, D. Parikh, and R. Socher. 2017. Know-
                                                             ing when to look: Adaptive attention via a visual
Robert M. French. 1999. Catastrophic forgetting in           sentinel for image captioning. In 2017 IEEE Confer-
  connectionist networks. Trends in Cognitive Sci-           ence on Computer Vision and Pattern Recognition
  ences, 3(4):128–135.                                       (CVPR), pages 3242–3250.

Thanh-Le Ha, Jan Niehues, and Alex Waibel. 2016. To-      Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
  ward Multilingual Neural Machine Translation with          Lee. 2019. Vilbert: Pretraining task-agnostic visi-
  Universal Encoder and Decoder. In Proceedings of           olinguistic representations for vision-and-language
  the 13th International Conference on Spoken Lan-           tasks. In Advances in Neural Information Process-
  guage Translation.                                         ing Systems, pages 13–23.

Kenji Imamura and Eiichiro Sumita. 2019. Recycling a      Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
  pre-trained BERT encoder for neural machine trans-         Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task
  lation. In Proceedings of the 3rd Workshop on Neu-         vision and language representation learning. In The
  ral Generation and Translation, pages 23–31, Hong          IEEE/CVF Conference on Computer Vision and Pat-
  Kong. Association for Computational Linguistics.           tern Recognition (CVPR).

Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim          Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
 Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,            2018. Neural baby talk. In Proceedings of the IEEE
 Fernanda Viégas, Martin Wattenberg, Greg Corrado,          Conference on Computer Vision and Pattern Recog-
 Macduff Hughes, and Jeffrey Dean. 2017. Google’s            nition, pages 7219–7228.
 multilingual neural machine translation system: En-
 abling zero-shot translation. Transactions of the As-    Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol
 sociation for Computational Linguistics, 5:339–351.        Vinyals, and Lukasz Kaiser. 2016. Multi-task Se-
                                                            quence to Sequence Learning. In International Con-
Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer,                ference on Learning Representations.
  Ashish Vaswani, Niki Parmar, Llion Jones, and
  Jakob Uszkoreit. 2017. One Model To Learn Them          Bryan McCann, James Bradbury, Caiming Xiong, and
  All.                                                      Richard Socher. 2017. Learned in translation: Con-
                                                            textualized word vectors. In Proceedings of the
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-          31st International Conference on Neural Informa-
  semantic alignments for generating image descrip-         tion Processing Systems, NIPS’17, page 6297–6308,
  tions. In Proceedings of the IEEE conference on           Red Hook, NY, USA. Curran Associates Inc.
  Computer Vision and Pattern Recognition, pages
  3128–3137.                                              Myle Ott, Sergey Edunov, Alexei Baevski, Angela
                                                           Fan, Sam Gross, Nathan Ng, David Grangier, and
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and             Michael Auli. 2019. fairseq: A fast, extensible
  Daxin Jiang. 2020a. Unicoder-VL: A universal en-         toolkit for sequence modeling. In Proceedings of
  coder for vision and language by cross-modal pre-        the 2019 Conference of the North American Chap-
  training. In The Thirty-Fourth AAAI Conference           ter of the Association for Computational Linguistics
  on Artificial Intelligence, AAAI 2020, The Thirty-       (Demonstrations), pages 48–53, Minneapolis, Min-
  Second Innovative Applications of Artificial Intelli-    nesota. Association for Computational Linguistics.
  gence Conference, IAAI 2020, The Tenth AAAI Sym-
  posium on Educational Advances in Artificial Intel-     Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  ligence, EAAI 2020, New York, NY, USA, February           Jing Zhu. 2002. Bleu: a method for automatic eval-
  7-12, 2020, pages 11336–11344. AAAI Press.                uation of machine translation. In Proceedings of
                                                            the 40th Annual Meeting of the Association for Com-
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,            putational Linguistics, pages 311–318, Philadelphia,
  Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong               Pennsylvania, USA. Association for Computational
  Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng           Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt             Hao Tan and Mohit Bansal. 2019. LXMERT: Learning
 Gardner, Christopher Clark, Kenton Lee, and Luke             cross-modality encoder representations from trans-
 Zettlemoyer. 2018. Deep contextualized word rep-             formers. In Proceedings of the 2019 Conference on
 resentations. In Proceedings of the 2018 Confer-             Empirical Methods in Natural Language Processing
 ence of the North American Chapter of the Associ-            and the 9th International Joint Conference on Natu-
 ation for Computational Linguistics: Human Lan-              ral Language Processing (EMNLP-IJCNLP), pages
 guage Technologies, Volume 1 (Long Papers), pages            5100–5111, Hong Kong, China. Association for
 2227–2237, New Orleans, Louisiana. Association               Computational Linguistics.
 for Computational Linguistics.
                                                            Jörg Tiedemann. 2012. Parallel data, tools and inter-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-               faces in OPUS. In Proceedings of the Eighth In-
  ine Lee, Sharan Narang, Michael Matena, Yanqi                 ternational Conference on Language Resources and
  Zhou, Wei Li, and Peter J. Liu. 2020. Exploring               Evaluation (LREC’12), pages 2214–2218, Istanbul,
  the limits of transfer learning with a unified text-to-       Turkey. European Language Resources Association
  text transformer. Journal of Machine Learning Re-             (ELRA).
  search, 21(140):1–67.
                                                            Mesut Erhan Unal, Begum Citamak, Semih Yagcioglu,
                                                             Aykut Erdem, Erkut Erdem, Nazli Ikizler Cinbis,
Cyrus Rashtchian, Peter Young, Micah Hodosh, and
                                                             and Ruket Cakici. 2016. Tasviret: A benchmark
  Julia Hockenmaier. 2010. Collecting image annota-
                                                             dataset for automatic turkish description generation
  tions using Amazon’s Mechanical Turk. In Proceed-
                                                             from images. In 2016 24th Signal Processing
  ings of the NAACL HLT 2010 Workshop on Creating
                                                             and Communication Application Conference (SIU),
  Speech and Language Data with Amazon’s Mechan-
                                                             pages 1977–1980. IEEE.
  ical Turk, pages 139–147, Los Angeles. Association
  for Computational Linguistics.                            Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                              Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.           Kaiser, and Illia Polosukhin. 2017. Attention is all
  2020. Leveraging pre-trained checkpoints for se-            you need. In Advances in neural information pro-
  quence generation tasks. Transactions of the Asso-          cessing systems, pages 5998–6008.
  ciation for Computational Linguistics, 8:264–280.
                                                            Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Piyush Sharma, Nan Ding, Sebastian Goodman, and               Parikh. 2015. Cider: Consensus-based image de-
   Radu Soricut. 2018.       Conceptual captions: A           scription evaluation. In Proceedings of the IEEE
   cleaned, hypernymed, image alt-text dataset for au-        Conference on Computer Vision and Pattern Recog-
   tomatic image captioning. In Proceedings of the            nition, pages 4566–4575.
  56th Annual Meeting of the Association for Compu-
   tational Linguistics (Volume 1: Long Papers), pages      Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
   2556–2565, Melbourne, Australia. Association for           Chaumond, Clement Delangue, Anthony Moi, Pier-
   Computational Linguistics.                                 ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
                                                              icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-           Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  Yan Liu. 2019. MASS: Masked sequence to se-                 Teven Le Scao, Sylvain Gugger, Mariama Drame,
  quence pre-training for language generation. In Pro-        Quentin Lhoest, and Alexander Rush. 2020. Trans-
  ceedings of the 36th International Conference on            formers: State-of-the-art natural language process-
  Machine Learning, volume 97 of Proceedings of Ma-           ing. In Proceedings of the 2020 Conference on Em-
  chine Learning Research, pages 5926–5936. PMLR.             pirical Methods in Natural Language Processing:
                                                              System Demonstrations, pages 38–45, Online. Asso-
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,             ciation for Computational Linguistics.
  Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-
                                                            Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi
  training of Generic Visual-Linguistic Representa-
                                                               Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020.
  tions. In International Conference on Learning Rep-
                                                               Towards making the most of BERT in neural ma-
  resentations.
                                                               chine translation. In Proceedings of the AAAI Con-
                                                               ference on Artificial Intelligence, volume 34, pages
Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos,            9378–9385.
 Aku Rouhe, Desmond Elliott, Lucia Specia, and
 Jörg Tiedemann. 2020. Multimodal machine trans-           Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
 lation through visuals and speech. Machine Transla-          bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
 tion, pages 1–51.                                            XLNet: Generalized Autoregressive Pretraining for
                                                              Language Understanding. In Advances in Neural In-
Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-             formation Processing Systems, volume 32. Curran
  phy, and Cordelia Schmid. 2019. Videobert: A joint          Associates, Inc.
  model for video and language representation learn-
  ing. In Proceedings of the IEEE/CVF International         Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
  Conference on Computer Vision, pages 7464–7473.             enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
  ference over event descriptions. Transactions of the
  Association for Computational Linguistics, 2:67–78.
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
  Hu, Jason Corso, and Jianfeng Gao. 2020. Uni-
  fied vision-language pre-training for image caption-
  ing and vqa. Proceedings of the AAAI Conference
  on Artificial Intelligence, 34(07):13041–13049.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
   Wengang Zhou, Houqiang Li, and Tie-Yan Liu.
   2020. Incorporating BERT into neural machine
   translation. In 8th International Conference on
   Learning Representations, ICLR 2020, Addis Ababa,
   Ethiopia, April 26-30, 2020. OpenReview.net.
Barret Zoph and Kevin Knight. 2016. Multi-source
  neural translation. In Proceedings of the 2016 Con-
  ference of the North American Chapter of the Asso-
  ciation for Computational Linguistics: Human Lan-
  guage Technologies, pages 30–34, San Diego, Cali-
  fornia. Association for Computational Linguistics.
A   Qualitative Examples

                 B ERT G EN: un groupe de jeunes sont réunis, et ils sont debout à l ’extérieur d’un bâtiment.
                              (a group of young people are gathered, and they are standing outside a building.)
                   M T R EF : des gens debout devant un bâtiment.
                              (people standing outside of a building.)

                 B ERT G EN: une petite fille lit un livre.
                              (a little girl reads a book.)
                   M T R EF : un jeune enfant dormant dans son lit avec un livre ouvert sur sa poitrine.
                              (a young child sleeping in her bed with an open book on her chest.)

                 B ERT G EN: des manifestants avec des pancartes.
                              (demonstrators with placards.)
                   M T R EF : des familles de militaires défilent dans new york un jour de pluie.
                              (military families are marching through new york on a rainy day.)

           Table 8: Zero-shot French image captioning examples for the F LICKR 30 K test set.
EN:    a group of people are riding on elephants through a river.
        FR:    un groupe de personnes sur des chevaux sur un bateau dans un ruisseau.
               a group of people on horses on a boat in a stream.
        DE:    eine gruppe reiter fährt auf einem fluss.
               a group of riders is riding on a river.
        TR:    bir grup insan bir derede duran dört tane at ile ilerliyorlar.
               a group of people are moving with four horses standing in a stream.

        EN:    a black dog is playing with a yellow toy in the grass.
        FR:    un chien avec une frisbee dans la pelouse.
               a dog with a frisbee in the lawn.
        DE:    ein schwarzer hund mit rotem halsband spielt mit einem gelben ball auf einer wiese.
               a black dog with a red collar is playing with a yellow ball in a meadow.
        TR:    yeşil bir topu ısırmaya çalışan siyah bir köpek.
               a black dog trying to bite a green ball.

        EN:    a boy in a red shirt and white shorts is playing tennis.
        FR:    un tennisteur frappe une balle.
               a tennisteur hits a ball.
        DE:    ein junge spielt tennis.
               a boy is playing tennis.
        TR:    tenis raketi ile topa vuran çocuk.
               boy hitting the ball with a tennis racket.

Table 9: COCO captioning examples: Italicised captions are Google’s translations into English for non-English
examples. Bold phrases highlight lexical variations or errors related to the salient visual concepts in the images.
The last example shows a morphological error that B ERT G EN does when trying to generate a tennis player in
French.
B ERT G EN   mehrere ngos, darunter die mozilla - und greenpeace - stiftung,
                       schätzen, dass diese neuen werkzeuge unfähig sind und zu spät kommen.
                       several ngos, including the mozilla and greenpeace foundations,
                       estimate that these new tools are incapable and come too late.
          W MT R EF    mehrere ngos, unter denen die mozilla - stiftung und greenpeace,
                       schätzen, dass diese neuen tools unzureichend sind und zu spät kommen.
                       several ngos, including the mozilla foundation and greenpeace,
                       estimate that these new tools are inadequate and come too late.
          B ERT G EN   immigration wird als ein großes problem für die ue betrachtet,
                       für 45 prozent der deutschen und 40 prozent aller europäischen.
                       immigration is seen as a big problem for the ue,
                       for 45 percent of germans and 40 percent of all european ones.
          W MT R EF    die einwanderung halten 45 prozent der deutschen und 40 prozent
                       aller europäer für das größte problem der eu.
                       45 percent of germans and 40 percent of all europeans
                       consider immigration to be the biggest problem in the eu.
          B ERT G EN   das ist der grund, warum er in seinem buch die frage erforscht,
                       ob es alternativen zu wahlen gibt.
                       that is the reason why he explores the question of whether
                       there are alternatives to choose from in his book.
          W MT R EF    deshalb geht er in seinem buch der frage nach,
                       ob es alternativen zu wahlen gibt.
                       therefore, in his book, he investigates the question of whether
                       there are alternatives to choose from.

Table 10: Zero-shot F R→D E translations of WMT’19 test set. The italicised sentences are Google Translate
translations of the German outputs into English.
You can also read