Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Page created by Joseph Jacobs
 
CONTINUE READING
Synthetic Data Augmentation
                                                              for Zero-Shot Cross-Lingual Question Answering
                                                                Arij Riabi‡∗ Thomas Scialom?∗ Rachel Keraron?
                                                                 Benoı̂t Sagot‡ Djamé Seddah‡ Jacopo Staiano?
                                                                                     ‡
                                                                                       INRIA Paris
                                                              
                                                                Sorbonne Université, CNRS, LIP6, F-75005 Paris, France
                                                                               ?
                                                                                 reciTAL, Paris, France
                                                                   {thomas,rachel,jacopo}@recital.ai
                                                          {arij.riabi,benoit.sagot,djame.seddah}@inria.fr

                                                                 Abstract                                 et al., 2020), covering respectively 10 and 7 lan-
                                                                                                          guages. Due to the cost of annotation, both are
                                             Coupled with the availability of large scale
                                                                                                          limited only to an evaluation set. They are compa-
                                             datasets, deep learning architectures have en-
arXiv:2010.12643v1 [cs.CL] 23 Oct 2020

                                             abled rapid progress on the Question Answer-
                                                                                                          rable to the validation set of the original SQuAD
                                             ing task. However, most of those datasets                    (see more details in Section 3.3). In both datasets,
                                             are in English, and the performances of state-               each paragraphs is paired with questions in various
                                             of-the-art multilingual models are significantly             languages. It enables the evaluation of models in
                                             lower when evaluated on non-English data.                    an experimental scenario for which the input con-
                                             Due to high data collection costs, it is not real-           text and question can be in two different languages.
                                             istic to obtain annotated data for each language             This scenario has practical applications, e.g. query-
                                             one desires to support.
                                                                                                          ing a set of documents in various languages.
                                             We propose a method to improve the Cross-                        Performing this cross-lingual task is complex
                                             lingual Question Answering performance                       and remains challenging for current models, as-
                                             without requiring additional annotated data,
                                                                                                          suming only English training data. Transfer results
                                             leveraging Question Generation models to pro-
                                             duce synthetic samples in a cross-lingual fash-              are shown to rank behind training-language perfor-
                                             ion. We show that the proposed method al-                    mance (Artetxe et al., 2020; Lewis et al., 2020). In
                                             lows to significantly outperform the baselines               other words, multilingual models fine-tuned only
                                             trained on English data only. We report a new                on English data are found to perform better for
                                             state-of-the-art on four multilingual datasets:              English than for other languages.
                                             MLQA, XQuAD, SQuAD-it and PIAF (fr).                             To the best of our knowledge, no alternative
                                                                                                          methods to this simple zero-shot transfer have been
                                         1    Introduction
                                                                                                          proposed so far for this task. In this paper, we pro-
                                         Question Answering is a fast growing research                    pose to generate synthetic data in a cross-lingual
                                         field, aiming to improve the capabilities of ma-                 fashion, borrowing the idea from monolingual QA
                                         chines to read and understand documents. Signifi-                research efforts (Duan et al., 2017). On English
                                         cant progress has recently been enabled by the use               corpora, generating synthetic questions has been
                                         of large pre-trained language models (Devlin et al.,             shown to significantly improve the performance of
                                         2019; Raffel et al., 2020), which reach human-level              QA models (Golub et al., 2017; Alberti et al., 2019).
                                         performances on several publicly available bench-                However, the adaptation of this technique to cross-
                                         marks, such as SQuAD (Rajpurkar et al., 2016) and                lingual QA is not straightforward: cross-lingual
                                         NewsQA (Trischler et al., 2017).                                 text generation is a challenging task per se which
                                            Given that the majority of large scale Ques-                  has not been yet extensively explored, in particular
                                         tion Answering (QA) datasets are in English (Her-                when no multilingual training data is available.
                                         mann et al., 2015; Rajpurkar et al., 2016; Choi                      We explore two Question Generation scenarios:
                                         et al., 2018), the importance to develop QA sys-                 i) requiring only SQuAD data; and ii) using a trans-
                                         tems targeting other languages, is currently been                lator tool to obtain translated versions of SQuAD.
                                         addressed with two cross-lingual QA datasets:                    As expected, the method leveraging on a transla-
                                         XQuAD (Artetxe et al., 2020) and MLQA (Lewis                     tor has shown to perform the best. Leveraging on
                                              ∗
                                                : equal contribution. The work of Arij Riabi was partly   such synthetic data, our best model obtains signif-
                                         carried out while she was working at reciTAL.                    icant improvements on XQuAD and MLQA over
the state-of-the-art for both Exact Match and F1         2019). Similar to QA, significant performance im-
scores. In addition, we evaluate the QA models           provements have been obtained using pre-trained
on languages not seen during training (even in the       language models (Dong et al., 2019). Still, due
synthetic data). On SQuAD-it and PIAF (fr), two          to the lack of multilingual datasets, most previous
Italian and French evaluation sets, we report a new      works have been limited to monolingual text gen-
state-of-the-art. This indicates that the proposed       eration. We note the exceptions of (Kumar et al.,
method allows to capture better multilingual rep-        2019) and (Chi et al., 2020), who resorted to multi-
resentations beyond the training languages. Our          lingual pre-training before fine-tuning on monolin-
method paves the way toward multilingual QA do-          gual downstream NLG tasks. However, the quality
main adaptation, especially for under-resourced          of the generated questions is reported to be below
languages.                                               the corresponding English ones.

2   Related Work                                         Question Generation for Question Answering
Question Answering (QA) QA is the task for               Data augmentation via synthetic data generation
which given a context and a question, a model has        is a well known technique to improve models’ ac-
to find the answer. The interest for Question An-        curacy and generalisation. It has found success-
swering goes back a long way: in a 1965 survey,          ful application in several areas, such as time series
Simmons (1965) reported fifteen implemented En-          analysis (Forestier et al., 2017) and computer vision
glish language question-answering systems. More          (Buslaev et al., 2020). In the context of QA, gener-
recently, with the rise of large scale datasets (Her-    ating synthetic questions to complete a dataset has
mann et al., 2015), and large pre-trained models         shown to improve QA performances (Duan et al.,
(Devlin et al., 2019), the performance drastically in-   2017; Alberti et al., 2019). So far, all these works
creased, approaching human-level performance on          have focused on English QA given the difficulty
standard benchmarks (see for instance the SQuAD          to generate questions in other languages without
leader board 1 . More challenging evaluation bench-      available data. This lack of data and the difficulty
marks have recently been proposed: Dua et al.            to obtain some constitutes the main motivation of
(2019) released the DROP dataset, for which the          our work and justify exploring cost-effective ap-
annotators are encouraged to annotate adversarial        proaches such as data augmentation via the genera-
questions; Burchell et al. (2020) released the MSQ       tion of questions.
dataset, consisting of multi-sentence questions.
   However, all these works are focused on English.      3       Data
Another popular research direction focuses on the
development of multilingual QA models. For this          3.1      English Training Data
purpose, the first step has been to provide the com-
                                                         SQuADen The original SQuAD (Rajpurkar et al.,
munity with multilingual evaluation sets: Artetxe
                                                         2016), which we refer as SQuADen for clarity in
et al. (2020) and Lewis et al. (2020) proposed con-
                                                         this paper. It is one of the first, and among the
currently two different evaluation sets which are
                                                         most popular, large scale QA datasets. It contains
comparable to the SQuAD development set. Both
                                                         about 100K question/paragraph/answer triplets in
reach the same conclusion: due to the lack of non-
                                                         English, annotated via Mechanical Turk.2
English training data, models don’t achieve the
same performance in Non-English languages than
they do in English. To the best of our knowledge,        QG datasets Any QA dataset can be reversed
no method has been proposed to fill this gap.            into a QG dataset, by switching the generation tar-
                                                         gets from the answers to the questions. In this
Question Generation QG can be seen as the                paper, we use the qg subscript to specify when the
dual task of QA: the input is composed of the an-        dataset is used for QG (e.g. SQuADen;qg indicates
swer and paragraph containing it, and the model is       the English SQuAD data in QG format).
trained to generate the question. Proposed by (Rus
et al., 2010), it has leveraged on the development of        2
                                                              Two versions of SQuAD have been released: v1.1, used
new QA datasets (Zhou et al., 2017; Scialom et al.,      in this work, and v2.0. The latter contains “unanswerable ques-
                                                         tions” in addition to those from v1.1. We use the former, since
  1
    https://rajpurkar.github.io/                         the multi-lingual evaluation datasets, MLQA and XQUAD, do
SQuAD-explorer/                                          not include unanswerable questions.
3.2   Synthetic Training Sets                           PIAF Keraron et al. (2020) provided an evalua-
SQuADtrans is a machine translated version of           tion set in French following the SQuAD protocol,
the SQuAD train set in the seven languages of           and containing 3835 examples.
MLQA, released by the authors together with their
                                                        KorQuAD 1.0 the Korean Question Answering
paper.
                                                        Dataset (Lim et al., 2019), a Korean dataset also
WikiScrap We collected 500 Wikipedia articles,          built following the SQuAD protocol.
following the SQuADen protocol, for all the lan-
guages present in MLQA. They are not paired with        SQuAD-it Derived from SQuADen , it was ob-
any question or answer. We use them as contexts         tained via semi-automatic translation to Italian
to generate synthetic multilingual questions, as de-    (Croce et al., 2018).
tailed in Section 4.2. We use project Nayuki’s
code3 to parse the top 10K Wikipedia pages ac-
                                                        4       Models
cording to the PageRank algorithm (Page et al.,         Recent works (Raffel et al., 2020; Lewis et al.,
1999). We then filter out the the paragraphs with       2019) have shown that classification tasks can be
character length outside of a [500, 1500] interval.     framed as a text-to-text problem, achieving state-
Articles with less than 5 paragraphs are discarded,     of-the-art results on established benchmark, such
since they tend to be less developed, in a lower        as GLUE (Wang et al., 2018). Accordingly, we
quality or being only redirection pages. Out of         employ the same architecture for both Question
the filtered articles, we randomly selected 500 per     Answering and Generation tasks. This also allows
language.                                               fairer comparisons for our purposes, by removing
3.3   Multilingual Evaluation Sets                      differences between QA and QG architectures and
                                                        their potential impact on the results obtained. In
XQuAD (Artetxe et al., 2020) is a human trans-          particular, we use a distilled version of XLM-R
lation of the SQuADen development set in 10 lan-        (Conneau et al., 2020): MiniLM-M (Wang et al.,
guages (Arabic, Chinese, German, Greek, Hindi,          2020) (see Section 4.3 for further details).
Russian, Spanish, Thai, Turkish, and Vietnamese),
with 1k QA pairs for each language.                     4.1      Baselines
MLQA (Lewis et al., 2020) is an evaluation              QANo-synth Following previous works, we fine-
dataset in 7 languages (English, Arabic, Chinese,       tuned the multilingual models on SQuADen , and
German, Hindi, and Spanish). The dataset is built       consider them as our baselines.
from aligned Wikipedia sentences across at least
two languages (full alignment between all lan-          English as Pivot Leveraging on translation mod-
guages being impossible), with the goal of provid-      els, we consider a second baseline method, which
ing natural rather than translated paragraphs. The      uses English as a pivot. First, both the question in
QA pairs are manually annotated on the English          language Lq and the paragraph in language Lp are
sentences and then human translated on the aligned      translated into English. We then invoke the base-
sentences. The dataset contains about 46K aligned       line model described above, QANo-synth , to predict
QA pairs in total.                                      the answer. Finally, the predicted answer is trans-
                                                        lated back into the target language Lp . We used the
Language-specific benchmarks In addition to             google translate API.4
the two aforementioned multilingual evaluation
data, we benchmark our models on three language-        4.2      Synthetic Data Augmentation
specific datasets for French, Italian and Korean,
                                                        In this work we consider data augmentation via
as detailed below. We choose these datasets since
                                                        generating synthetic questions, to improve the QA
none of these languages are present in XQuAD
                                                        performance. Different training schemes for the
or MLQA. Hence, they allow us to evaluate our
                                                        question generator are possible, resulting in dif-
models in a scenario where the target language is
                                                        ferent quality of synthetic data. Before this work,
not available during training, even for the synthetic
                                                        its impact on the final QA system remained unex-
questions.
                                                        plored, in the multilingual context.
  3
    https://www.nayuki.io/page/
                                                            4
computing-wikipedias-internal-pageranks                         https://translate.google.com
For all the following experiments, only the syn-     our models, we used the official MLQA evaluation
thetic data changes. Given a specific set of syn-       scripts.6
thetic data, we always follow the same two-stages
protocol, similar to (Alberti et al., 2019): we first   5     Results
train the QA model on the synthetic QA data, then       5.1    Question Generation
on SQuADen . We also tried to train the QA model
in one stage, with all the synthetic and human data     We report examples of generated questions in Ta-
shuffled together, but observed no improvements         ble 1.
over the baseline.                                      Controlling the Target Language In the con-
   We explored two different synthetic generation       text of multilingual text generation, controlling the
modes:                                                  target language is not trivial.
Synth the QG model is trained on SQuADen,qg                 When a QA model is trained only on English
(i.e., English data only) and the synthetic data are    data, at inference, given a non-English paragraph,
generated on WikiScrap. Under this setup, the only      it predicts the answer in the input language, as one
annotated samples this model has access to are          would expect, since it is an extractive process. Ide-
those from SQuAD-en.                                    ally, we would like to observe the same behavior for
                                                        a Question Generation model trained only on En-
Synth+trans the QG model is trained on                  glish data (such as Synth), leveraging on the mul-
SQuADtrans,qg in addition to SQuADen,qg . The           tilingual pre-training. Conversely to QA, QG is a
questions can be in a different languages than          language generation task. Multilingual generation
the context. Hence, the model needs an indica-          is much more challenging, as the model’s decoding
tion about the language it is expected to gener-        ability plays a major role. When a QG model is
ate the question. To control the target language,       fine-tuned only on English data (i.e SQuAD-en), its
we use a specific prompt per language, defin-           controllability of the target language suffers from
ing a special token , which corresponds           catastrophic forgetting: the input language do not
to the desired target language Y . Thus, the in-        propagate to the generated text. The questions are
put is structured as   Answer                generated in English, while still being related to the
 Context where  is a special to-              context. For instance, in Table 1, we observe that
ken separator, and  indicates to the model        the QGsynth model outputs English questions for
in what language the question should be generated.      the paragraphs in Chinese and Spanish. The same
These attributes offer flexibility on the target lan-   phenomenon was reported by Chi et al. (2020).
guage. Similar techniques are used in the literature
to control the style of the output (Keskar et al.,      Cross-language training To tackle the afore-
2019; Scialom et al., 2020; Chi et al., 2020).          mentioned limitation on target language control-
                                                        lability (i.e. to enable the generation in other lan-
4.3   Implementation details                            guages than English), multilingual data is needed.
For all our experiments we use Multilingual             We can leverage on the translated version of the
MiniLM v1 (MiniLM-m) (Wang et al., 2020), a             dataset to add the needed non-English examples.
12-layer with 384 hidden size architecture distilled    As detailed in Section 4.2, we simply use a spe-
from XLM-R Base multilingual (Conneau et al.,           cific prompt that corresponds to the target language
2020). With only 66M parameters, it is an order         (with N different prompts corresponding to the N
of magnitude smaller than state-of-the-art archi-       languages present in the dataset). In Table 1, we
tectures such as BERT-large or XLM-large. We            show how QGsynth+trans can generate questions in
used the official Microsoft implementation.5 For        the same language as the input. Further, the ques-
all the experiments—both QG and QA—we trained           tions seem much more relevant, coherent and fluent,
the model for 5 epochs, using the default hyper-        if compared to those produced by QGsynth : for the
parameters. We used a single nVidia gtx2080ti           Spanish paragraph, the question is well formed and
with 11G RAM, and the training times to circa 4         focused on the input answer; for the Chinese (see
and 2 hours amount respectively for Question Gen-       last row of Table 1 for QGsynth+trans ) is perfectly
eration and for Question Answering. To evaluate         written.
  5                                                       6
    Publicly available at https://github.com/               https://github.com/facebookresearch/
microsoft/unilm/tree/master/minilm.                     MLQA/blob/master/mlqa_evaluation_v1.py
Paragraph (EN) Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.
  He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led
  the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football
  Operations and General Manager.
  Answer Broncos
  QGsynth What team did John Elway lead to victory at age 38?
  QGsynth+trans (target language = en) What team did John Elway lead to win in the Super Bowl?
  Paragraph (ES) Peyton Manning se convirtió en el primer mariscal de campo de la historia en llevar a dos equipos
  diferentes a participar en múltiples Super Bowls. Ademas, es con 39 años, el mariscal de campo más longevo de la
  historia en jugar ese partido. El récord anterior estaba en manos de John Elway —mánager general y actual vicepresidente
  ejecutivo para operaciones futbolı́sticas de Denver— que condujo a los Broncos a la victoria en la Super Bowl XXXIII a
  los 38 años de edad.
  Answer Broncos
  QGsynth Where did Peyton Manning condujo?
  QGsynth+trans (target language = es) Qué equipo ganó el récord anterior? (Which team won the previous record?)
  QGsynth+trans (target language = en) What team did Menning win in the Super Bowl?
  Paragraph (ZH) 培顿·曼宁成为史上首位带领两支不同球队多次进入超级碗的四分卫。他也以39 岁高龄参加超
  级碗而成为史上年龄最大的四分卫。过去的记录是由约翰·埃尔维保持的,他在38岁时带领野马队赢得第33 届
  超级碗,目前担任丹佛的橄榄球运营执行副总裁兼总经理
  Answer 野马队
  QGsynth What is the name for the name that the name is used?
  QGsynth+trans (target language = zh) 约翰·埃尔维在13岁时带领哪支球队赢得第33届超级碗? (Which team did John
  Elvey lead to win the 33rd Super Bowl at the age of 13?)
  QGsynth+trans (target language = en) What team won the 33th Super Bowl?

Table 1: Example of questions generated by the different models on an XQuAD’s paragraph in different languages.
For QGsynth+trans , we report the outputs given two target languages, the one of the context and English.

5.2   Question Answering                                         consistent. Hence, the model was never exposed
We report the main results of our experiments on                 to different languages across the same input text.
XQuAD and MLQA in Table 2. The scores corre-                     The synthetic inputs are composed of questions
spond to the average over all the different possible             mostly in English (see examples in Table 1) while
combination of languages (de-de, de-ar, etc.).                   the contexts can be in any languages. Therefore,
                                                                 the QA model is exposed for the first time to a
English as Pivot Using English as pivot does not                 cross-lingual scenario. We hypothesise that such
lead to good results. This may be due to the evalu-              a cross-lingual ability is not innate for a default
ation metrics based on n-grams similarity. For ex-               multilingual model: exposing a model to this sce-
tractive QA, F1 and EM metrics measure the over-                 nario allows to develop this ability and contributes
lap between the predicted answer and the ground                  to improve its performance.
truth. Therefore, meaningful answers worded dif-
                                                                 Synthetic with translation (+synth+trans) :
ferently are penalized, a situation that is likely to
                                                                 For MiniLM+synth+trans , we obtain a much larger im-
occur because of the back-translation mechanism.
                                                                 provement over its baselines, MiniLM, compared
Therefore, automatic evaluation is challenging for
                                                                 to MiniLM+synth on both MLQA and XQuAD. This
this setup, and suffer from similar difficulties than
                                                                 confirms the intuition developed in the paragraph
in text generation (Sulem et al., 2018). As an ad-
                                                                 above, that independently of the multilingual ca-
ditional downside, it requires multiple translations
                                                                 pacity of the model, a cross-lingual ability is de-
at inference time. For these reasons, we decided to
                                                                 veloped when the two inputs components are not
not explore further this approach.
                                                                 exclusively written in the same language. In the
Synthetic without translation (+synth) Com-                      Section 5.3, we discuss more in depth this phe-
pared to the MiniLM baseline, we observe a small                 nomenon.
increase in term of performance for MiniLM+synth
(the Exact Match increases from 29.5 to 33.1 on                  5.3    Discussion
XQuAD and from 26.0 to 27.5 on MLQA).                            Cross Lingual Generalisation To explore the
   During the self-supervised pre-training stage, the            models’ effectiveness in dealing with cross lingual
model was exposed to multilingual inputs. Yet,                   inputs, we report in Figure 1 the performance for
for a given input, the target language was always                our MiniLM+synth+trans setup, varying the number of
#Params                      Trans.                  XQuAD                                                      MLQA
                                            MiniLM (Wang et al., 2020)                                                                     66M                     No                42.2 / 29.5                                                    38.4 / 26.0
                                            XLM (Hu et al., 2020)7                                                                        340M                     No                68.4/53.0                                                      65.4 / 47.9
                                            English as Pivot                                                                               66M                     Yes                                                                              36.1 / 23.0
                                            M iniLM+synth                                                                                 66M                      No                44.8 / 33.1                                                    39.8 / 27.5
                                            M iniLM+synth+trans                                                                           66M                      Yes               63.3 / 49.1                                                    56.1 / 41.4
                                            XLM+synth+trans                                                                               340M                     Yes               74.3 / 59.2                                                    65.3 / 49.2

                                               Table 2: Results (F1 / EM) of the different QA models on XQuAD and MLQA.

                                                                                                                                                                                                                                                                                     Not All Languages (f1 Cross)
             55.0                                                                                                                                                                                                               40.0%                                                All Languages (f1 Cross)
                                                                                                                60

                                                                                                                                                                                                    Relative Improvement (f1)
             52.5
                                                                                                                                                                                                                                30.0%
             50.0
                                                                                                                55
                                                                                                   Score (f1)
Score (f1)

             47.5                                                                                                                                                                                                               20.0%
             45.0                                                                                               50
                                                                                                                                                                                                                                10.0%
             42.5
                                                           Not all languages (f1 Cross)                         45                                          Not all languages (f1 Cross)
             40.0                                                                                                                                                                                                                0.0%
                                                           All languages (f1 Cross)                                                                         All languages (f1 Cross)

                                                                                                                                                                                                                                        0 to 125K

                                                                                                                                                                                                                                                                8K

                                                                                                                                                                                                                                                                                7K

                                                                                                                                                                                                                                                                                                     7K

                                                                                                                                                                                                                                                                                                           (Synth Cr 6K
                                                                                                                                                                                                                                                                                                                    oss )
                                                                                                                                   125K

                                                                                                                                                208K

                                                                                                                                                           277K

                                                                                                                                                                       347K
                                  125K

                                               208K

                                                          277K

                                                                      347K

                                                                                                                     (Bas0eline)
                    (Bas0eline)

                                                                                                                                                                                            oss )
                                                                                           oss )

                                                                                                                                                                                                                                                     125K to 20

                                                                                                                                                                                                                                                                     208K to 27

                                                                                                                                                                                                                                                                                          277K to 34

                                                                                                                                                                                                                                                                                                          347K to 41
                                                                                                                                                                                  (Synth6KCr
                                                                                 (Synth6KCr

                                                                                                                                                                                     41
                                                                                    41

                                         Number of synthetic questions                                                                    Number of synthetic questions                                                                                     Number of synthetic questions

Figure 1: Left: F1 score on MLQA, for models with different number of synthetic data in two setups: for All Lan-
guages, the synthetic questions are sampled among all the 5 different MLQA’s languages; for Not All Languages,
the synthetic questions are sampled progressively from only one language, two, . . . , to all 5 for the last point, which
corresponds to All Languages. Middle: same as on the left but evaluated on XQuAD. Right: The relative variation
in performance for these models.

samples and the languages present in the synthetic                                                                                                          However, it seems that even with only one lan-
data. The abscissa x corresponds to the progres-                                                                                                         guage pair present, the model is able to develop a
sively increasing number of synthetic samples used;                                                                                                      cross lingual ability that benefits –in part– on other
at x = 0 corresponds to the baseline MiniLM+trans ,                                                                                                      languages: at the right-hand side of Figure 1, we
where the model has access only to the original                                                                                                          can see that most of the improvement is happen-
English data from SQuADen . We explore two sam-                                                                                                          ing given only one cross-lingual language pair (i.e.
pling strategies for the synthetic examples:                                                                                                             English and Spanish).

             1. All Languages corresponds to sampling the                                                                                                Unseen Languages To measure the benefit of
                examples from any of the different languages.                                                                                            our approach on unseen languages (i.e. lan-
                                                                                                                                                         guage not present in the synthetic data from
             2. Conversely, for Not All Languages, we pro-                                                                                               MLQA/XQuAD), we test our models on three QA
                gressively added the different languages: for                                                                                            evaluation sets: PIAF (fr), KorQuAD and SQuAD-
                x = 125K, all the 125K synthetic data are on                                                                                             it (see Section 3.3. The results are consistent with
                a Spanish input. Then German in addition of                                                                                              the previous experiments on MLQA and XQuAD.
                Spanish; Hindu; finally, Arab, all the MLQA                                                                                              Our MiniLM+synth+trans model improved by more
                languages are present.                                                                                                                   than 4 Exact Match points over its baseline, while
                                                                                                                                                         XLM+synth+trans obtains a new state-of-the-art. No-
   In Figure 1, we observe that the performance for                                                                                                      tably, our multilingual XLM+synth+trans outperforms
All Language Cross increases largely at the begin-                                                                                                       CamemBERT on PIAF, while the later is a pure
ning, then remains mostly stable. Conversely, we                                                                                                         monolingual, in domain language model.
can observe a gradually improvement for Not All
Language Cross, as more languages are present in                                                                                                        Hidden distillation effect The relative im-
the training. It indicates that when all the languages                                                                                                  provement for our best synthetic configuration
are present in the synthetic data, the model obtains                                                                                                    +synth+trans over the baseline, is superior to 60%
immediately a cross-lingual ability.                                                                                                                    EM for MiniLM (from 29.5 to 49.5 on XQuAD
#Params          PIAF (fr)    KorQuAD        SQuAD-it
         MiniLM                                     66M         58.9 / 34.3   53.3 / 40.5   72.0 / 57.7
         BERT-m (Croce et al., 2018)               110M                   -             -   74.1 / 62.5
         CamemBERT (Martin et al., 2020)           340M            68.9 / -             -             -
         MiniLM+synth                               66M         58.6 / 34.5   52.1 / 39.0   71.3 / 58.0
         MiniLM+synth+trans                         66M         63.9 / 40.6   60.0 / 48.8   74.5 / 62.0
         XLM+synth+trans                           340M         72.1 / 47.1   63.0 / 52.8   80.4 / 67.6

Table 3: Zero-shot results (F1 / EM) on PIAF, KorQuAD and SQuAD-it for our different QA models, compared to
various baselines. Note that CamemBERT actually corresponds to a French version of RoBERTa, an architecture
widely outperforming BERT.

and from 26.0 to 41.4 on MLQA). It is way more          that the model might adopt the correct language at
important than for XLM, where the EM improved           inference, even for a target language unseen during
by +11.7% on XQuAD and +2.71% on MLQA.                  training.8 A similar intuition has been explored in
It indicates that XLM contains more cross-lingual       GPT-2 (Radford et al., 2019): the authors report
transfer abilities than MiniLM. Since the latter is a   an improvement for summarization when the input
distilled version of the former, we hypothesize that    text is followed by “TL;DR” (i.e. Too Long Didn’t
such abilities may have been lost during the distil-    Read).
lation process. Such loss of generalisation can be         At inference time, we evaluated this approach on
difficult to identify, and opens questions for future   French with the prompt language=Français.
work.                                                   Unfortunately, the model did not succeed to gener-
                                                        ate text in French.
On target language control for text generation             Controlling the target language in the context
When relying on the translated versions of SQuAD,       of multilingual text generation remains under ex-
the target language for generating synthetic can        plored, and progress in this direction could have
easily be controlled, and results in fluent and rel-    direct applications to improve this work, and be-
evant questions in the various languages (see Ta-       yond.
ble 1). However, one limitation of this approach
is that synthetic questions can only be generated       6       Conclusion
in the languages that were available during train-
ing: the  prompts are special tokens that         In this work, we presented a method to generate
are randomly initialised when fine-tuning QG on         synthetic QA dataset in a multilingual fashion and
SQuADtrans;qg . They are not semantically related,      showed how QA models can benefit from it, and re-
before fine-tuning, to the name of corresponding        porting large improvements over the baseline. Our
language (“English”, “Español” etc.) and, thus, the    method contributes to fill the gap between English
learned representation for the  tokens is         and other languages. Nonetheless, our method has
limited to the languages present in the training set.   shown to generalize even for languages not present
   To the best of our knowledge, no method allows       in the synthetic corpus (e.g. French, Italian, Ko-
so far to generalize this target control to unseen      rean).
language. It would be valuable, for instance, to be        In future work, we plan to investigate if the pro-
able to generate synthetic data in Korean, French       posed data augmentation method could be applied
and Italian, without having to translate the entire     to other multilingual tasks, such as classification.
SQuAD−en dataset in these three languages to            We will also experiment more in depth with dif-
then fine-tune the QG model.                            ferent strategies to control the target language of a
   To this purpose, we report –alas, as a negative      model, and extrapolate on unseen ones. We believe
result– the following attempt: instead of control-      that this difficulty to extrapolate might highlight an
ling the target language with a special, randomly       important limitation of current language models.
initialised, token, we used a token semantically re-        8
                                                             With unseen during training, we mean not present in the
lated to the language-word: “English”, “Español”       QG dataset; obviously, the language should have been present
for Spanish, or “中文” for Chinese. The intuition is      in the first self-supervised stage.
Acknowledgments                                           ficial Intelligence, pages 389–402, Cham. Springer
                                                          International Publishing.
Djamé Seddah was partly funded by the ANR
project PARSITI (ANR-16-CE33-0021), Arij Ri-            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
abi was partly funded by Benoı̂t Sagot’s chair in          Kristina Toutanova. 2019. BERT: Pre-training of
                                                           deep bidirectional transformers for language under-
the PRAIRIE institute as part of the French na-            standing. In Proceedings of the 2019 Conference
tional agency ANR “Investissements d’avenir” pro-          of the North American Chapter of the Association
gramme (ANR-19-P3IA-0001).                                 for Computational Linguistics: Human Language
                                                          Technologies, Volume 1 (Long and Short Papers),
                                                           pages 4171–4186, Minneapolis, Minnesota. Associ-
References                                                 ation for Computational Linguistics.

Chris Alberti, Daniel Andor, Emily Pitler, Jacob De-    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-
  vlin, and Michael Collins. 2019. Synthetic QA cor-      aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
  pora generation with roundtrip consistency. In Pro-     and Hsiao-Wuen Hon. 2019. Unified language
  ceedings of the 57th Annual Meeting of the Asso-        model pre-training for natural language understand-
  ciation for Computational Linguistics, pages 6168–      ing and generation. In Advances in Neural Informa-
  6173, Florence, Italy. Association for Computa-         tion Processing Systems, pages 13063–13075.
  tional Linguistics.
                                                        Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.        Stanovsky, Sameer Singh, and Matt Gardner. 2019.
  2020. On the cross-lingual transferability of mono-     DROP: A reading comprehension benchmark requir-
  lingual representations. In Proceedings of the 58th     ing discrete reasoning over paragraphs. In Proceed-
  Annual Meeting of the Association for Computa-          ings of the 2019 Conference of the North American
  tional Linguistics, pages 4623–4637, Online. Asso-      Chapter of the Association for Computational Lin-
  ciation for Computational Linguistics.                  guistics: Human Language Technologies, Volume 1
Laurie Burchell, Jie Chi, Tom Hosking, Nina               (Long and Short Papers), pages 2368–2378, Min-
  Markl, and Bonnie Webber. 2020. Querent in-             neapolis, Minnesota. Association for Computational
  tent in multi-sentence questions. arXiv preprint        Linguistics.
  arXiv:2010.08980.
                                                        Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou.
Alexander Buslaev, Vladimir I Iglovikov, Eugene           2017. Question generation for question answering.
  Khvedchenya, Alex Parinov, Mikhail Druzhinin,           In Proceedings of the 2017 Conference on Empiri-
  and Alexandr A Kalinin. 2020. Albumentations:           cal Methods in Natural Language Processing, pages
  fast and flexible image augmentations. Information,     866–874, Copenhagen, Denmark. Association for
  11(2):125.                                              Computational Linguistics.

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-        Germain Forestier, François Petitjean, Hoang Anh
  Ling Mao, and Heyan Huang. 2020. Cross-lingual          Dau, Geoffrey I Webb, and Eamonn Keogh. 2017.
  natural language generation via pre-training. In        Generating synthetic time series to augment sparse
  AAAI, pages 7570–7577.                                  datasets. In 2017 IEEE international conference on
                                                          data mining (ICDM), pages 865–870. IEEE.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
  tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-    David Golub, Po-Sen Huang, Xiaodong He, and
  moyer. 2018. QuAC: Question answering in con-           Li Deng. 2017. Two-stage synthesis networks for
  text. In Proceedings of the 2018 Conference on          transfer learning in machine comprehension. In Pro-
  Empirical Methods in Natural Language Processing,       ceedings of the 2017 Conference on Empirical Meth-
  pages 2174–2184, Brussels, Belgium. Association         ods in Natural Language Processing, pages 835–
  for Computational Linguistics.                          844, Copenhagen, Denmark. Association for Com-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,         putational Linguistics.
  Vishrav Chaudhary, Guillaume Wenzek, Francisco
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-        Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
  moyer, and Veselin Stoyanov. 2020. Unsupervised         stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
  cross-lingual representation learning at scale. In      and Phil Blunsom. 2015. Teaching machines to read
  Proceedings of the 58th Annual Meeting of the Asso-     and comprehend. In Advances in neural information
  ciation for Computational Linguistics, pages 8440–      processing systems, pages 1693–1701.
  8451, Online. Association for Computational Lin-
  guistics.                                             Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-
                                                          ham Neubig, Orhan Firat, and Melvin Johnson.
Danilo Croce, Alexandra Zelenanska, and Roberto           2020. Xtreme: A massively multilingual multi-task
  Basili. 2018. Neural learning for question answer-      benchmark for evaluating cross-lingual generaliza-
  ing in italian. In AI*IA 2018 – Advances in Arti-       tion. arXiv preprint arXiv:2003.11080.
Rachel Keraron, Guillaume Lancrenon, Mathilde Bras,         transformer. Journal of Machine Learning Research,
  Frédéric Allary, Gilles Moyse, Thomas Scialom,          21(140):1–67.
  Edmundo-Pavel Soriano-Morales, and Jacopo Sta-
  iano. 2020. Project PIAF: Building a native French      Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  question-answering dataset. In Proceedings of the         Percy Liang. 2016. SQuAD: 100,000+ questions for
  12th Language Resources and Evaluation Confer-            machine comprehension of text. In Proceedings of
  ence, pages 5481–5490, Marseille, France. Euro-           the 2016 Conference on Empirical Methods in Natu-
  pean Language Resources Association.                      ral Language Processing, pages 2383–2392, Austin,
                                                            Texas. Association for Computational Linguistics.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
  Caiming Xiong, and Richard Socher. 2019. Ctrl: A        Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean,
  conditional transformer language model for control-       Svetlana Stoyanchev, and Christian Moldovan. 2010.
  lable generation. arXiv preprint arXiv:1909.05858.        The first question generation shared task evaluation
                                                            challenge. In Proceedings of the 6th International
Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee,           Natural Language Generation Conference.
  Ganesh Ramakrishnan, and Preethi Jyothi. 2019.
  Cross-lingual training for automatic question gen-      Thomas Scialom, Benjamin Piwowarski, and Jacopo
  eration. In Proceedings of the 57th Annual Meet-          Staiano. 2019.      Self-attention architectures for
  ing of the Association for Computational Linguis-         answer-agnostic neural question generation. In Pro-
  tics, pages 4863–4872, Florence, Italy. Association       ceedings of the 57th Annual Meeting of the Asso-
  for Computational Linguistics.                            ciation for Computational Linguistics, pages 6027–
                                                            6032, Florence, Italy. Association for Computa-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-                   tional Linguistics.
  jan Ghazvininejad, Abdelrahman Mohamed, Omer
  Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.         Thomas Scialom, Serra Sinem Tekiroglu, Jacopo Sta-
  Bart: Denoising sequence-to-sequence pre-training         iano, and Marco Guerini. 2020. Toward stance-
  for natural language generation, translation, and         based personas for opinionated dialogues. arXiv
  comprehension. arXiv preprint arXiv:1910.13461.           preprint arXiv:2010.03369.
                                                          Robert F Simmons. 1965. Answering english ques-
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian
                                                            tions by computer: a survey. Communications of
  Riedel, and Holger Schwenk. 2020. MLQA: Evalu-
                                                            the ACM, 8(1):53–70.
  ating cross-lingual extractive question answering. In
  Proceedings of the 58th Annual Meeting of the Asso-     Elior Sulem, Omri Abend, and Ari Rappoport. 2018.
  ciation for Computational Linguistics, pages 7315–         BLEU is not suitable for the evaluation of text sim-
  7330, Online. Association for Computational Lin-           plification. In Proceedings of the 2018 Conference
  guistics.                                                  on Empirical Methods in Natural Language Process-
                                                             ing, pages 738–744, Brussels, Belgium. Association
Seungyoung Lim, Myungji Kim, and Jooyoul Lee.                for Computational Linguistics.
  2019. Korquad1. 0: Korean qa dataset for ma-
  chine reading comprehension. arXiv preprint             Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
  arXiv:1909.07005.                                         ris, Alessandro Sordoni, Philip Bachman, and Ka-
                                                            heer Suleman. 2017. NewsQA: A machine compre-
Louis Martin, Benjamin Muller, Pedro Javier Or-             hension dataset. In Proceedings of the 2nd Work-
  tiz Suárez, Yoann Dupont, Laurent Romary, Éric          shop on Representation Learning for NLP, pages
  de la Clergerie, Djamé Seddah, and Benoı̂t Sagot.        191–200, Vancouver, Canada. Association for Com-
  2020. CamemBERT: a tasty French language model.           putational Linguistics.
  In Proceedings of the 58th Annual Meeting of the
  Association for Computational Linguistics, pages        Alex Wang, Amanpreet Singh, Julian Michael, Fe-
  7203–7219, Online. Association for Computational          lix Hill, Omer Levy, and Samuel Bowman. 2018.
  Linguistics.                                              GLUE: A multi-task benchmark and analysis plat-
                                                            form for natural language understanding. In Pro-
Lawrence Page, Sergey Brin, Rajeev Motwani, and             ceedings of the 2018 EMNLP Workshop Black-
  Terry Winograd. 1999. The pagerank citation rank-         boxNLP: Analyzing and Interpreting Neural Net-
  ing: Bringing order to the web. Technical report,         works for NLP, pages 353–355, Brussels, Belgium.
  Stanford InfoLab.                                         Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,        Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao,
  Dario Amodei, and Ilya Sutskever. 2019. Language          Nan Yang, and Ming Zhou. 2020. Minilm: Deep
  models are unsupervised multitask learners. OpenAI        self-attention distillation for task-agnostic compres-
  blog, 1(8):9.                                             sion of pre-trained transformers. arXiv preprint
                                                            arXiv:2002.10957.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,         Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
  Wei Li, and Peter J Liu. 2020. Exploring the lim-         Hangbo Bao, and Ming Zhou. 2017. Neural ques-
  its of transfer learning with a unified text-to-text      tion generation from text: A preliminary study. In
National CCF Conference on Natural Language
Processing and Chinese Computing, pages 662–671.
Springer.
You can also read