Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Page created by Joseph Jacobs

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Synthetic Data Augmentation
                                                              for Zero-Shot Cross-Lingual Question Answering
                                                                Arij Riabi‡∗ Thomas Scialom?∗ Rachel Keraron?
                                                                 Benoı̂t Sagot‡ Djamé Seddah‡ Jacopo Staiano?
                                                                                     ‡
                                                                                       INRIA Paris
                                                              
                                                                Sorbonne Université, CNRS, LIP6, F-75005 Paris, France
                                                                               ?
                                                                                 reciTAL, Paris, France
                                                                   {thomas,rachel,jacopo}@recital.ai
                                                          {arij.riabi,benoit.sagot,djame.seddah}@inria.fr

                                                                 Abstract                                 et al., 2020), covering respectively 10 and 7 lan-
                                                                                                          guages. Due to the cost of annotation, both are
                                             Coupled with the availability of large scale
                                                                                                          limited only to an evaluation set. They are compa-
                                             datasets, deep learning architectures have en-
arXiv:2010.12643v1 [cs.CL] 23 Oct 2020

                                             abled rapid progress on the Question Answer-
                                                                                                          rable to the validation set of the original SQuAD
                                             ing task. However, most of those datasets                    (see more details in Section 3.3). In both datasets,
                                             are in English, and the performances of state-               each paragraphs is paired with questions in various
                                             of-the-art multilingual models are significantly             languages. It enables the evaluation of models in
                                             lower when evaluated on non-English data.                    an experimental scenario for which the input con-
                                             Due to high data collection costs, it is not real-           text and question can be in two different languages.
                                             istic to obtain annotated data for each language             This scenario has practical applications, e.g. query-
                                             one desires to support.
                                                                                                          ing a set of documents in various languages.
                                             We propose a method to improve the Cross-                        Performing this cross-lingual task is complex
                                             lingual Question Answering performance                       and remains challenging for current models, as-
                                             without requiring additional annotated data,
                                                                                                          suming only English training data. Transfer results
                                             leveraging Question Generation models to pro-
                                             duce synthetic samples in a cross-lingual fash-              are shown to rank behind training-language perfor-
                                             ion. We show that the proposed method al-                    mance (Artetxe et al., 2020; Lewis et al., 2020). In
                                             lows to significantly outperform the baselines               other words, multilingual models fine-tuned only
                                             trained on English data only. We report a new                on English data are found to perform better for
                                             state-of-the-art on four multilingual datasets:              English than for other languages.
                                             MLQA, XQuAD, SQuAD-it and PIAF (fr).                             To the best of our knowledge, no alternative
                                                                                                          methods to this simple zero-shot transfer have been
                                         1    Introduction
                                                                                                          proposed so far for this task. In this paper, we pro-
                                         Question Answering is a fast growing research                    pose to generate synthetic data in a cross-lingual
                                         field, aiming to improve the capabilities of ma-                 fashion, borrowing the idea from monolingual QA
                                         chines to read and understand documents. Signifi-                research efforts (Duan et al., 2017). On English
                                         cant progress has recently been enabled by the use               corpora, generating synthetic questions has been
                                         of large pre-trained language models (Devlin et al.,             shown to significantly improve the performance of
                                         2019; Raffel et al., 2020), which reach human-level              QA models (Golub et al., 2017; Alberti et al., 2019).
                                         performances on several publicly available bench-                However, the adaptation of this technique to cross-
                                         marks, such as SQuAD (Rajpurkar et al., 2016) and                lingual QA is not straightforward: cross-lingual
                                         NewsQA (Trischler et al., 2017).                                 text generation is a challenging task per se which
                                            Given that the majority of large scale Ques-                  has not been yet extensively explored, in particular
                                         tion Answering (QA) datasets are in English (Her-                when no multilingual training data is available.
                                         mann et al., 2015; Rajpurkar et al., 2016; Choi                      We explore two Question Generation scenarios:
                                         et al., 2018), the importance to develop QA sys-                 i) requiring only SQuAD data; and ii) using a trans-
                                         tems targeting other languages, is currently been                lator tool to obtain translated versions of SQuAD.
                                         addressed with two cross-lingual QA datasets:                    As expected, the method leveraging on a transla-
                                         XQuAD (Artetxe et al., 2020) and MLQA (Lewis                     tor has shown to perform the best. Leveraging on
                                              ∗
                                                : equal contribution. The work of Arij Riabi was partly   such synthetic data, our best model obtains signif-
                                         carried out while she was working at reciTAL.                    icant improvements on XQuAD and MLQA over

the state-of-the-art for both Exact Match and F1 2019). Similar to QA, significant performance im-
scores. In addition, we evaluate the QA models provements have been obtained using pre-trained
on languages not seen during training (even in the language models (Dong et al., 2019). Still, due
synthetic data). On SQuAD-it and PIAF (fr), two to the lack of multilingual datasets, most previous
Italian and French evaluation sets, we report a new works have been limited to monolingual text gen-
state-of-the-art. This indicates that the proposed eration. We note the exceptions of (Kumar et al.,
method allows to capture better multilingual rep- 2019) and (Chi et al., 2020), who resorted to multi-
resentations beyond the training languages. Our lingual pre-training before fine-tuning on monolin-
method paves the way toward multilingual QA do- gual downstream NLG tasks. However, the quality
main adaptation, especially for under-resourced of the generated questions is reported to be below
languages. the corresponding English ones.

2 Related Work Question Generation for Question Answering
Question Answering (QA) QA is the task for Data augmentation via synthetic data generation
which given a context and a question, a model has is a well known technique to improve models’ ac-
to find the answer. The interest for Question An- curacy and generalisation. It has found success-
swering goes back a long way: in a 1965 survey, ful application in several areas, such as time series
Simmons (1965) reported fifteen implemented En- analysis (Forestier et al., 2017) and computer vision
glish language question-answering systems. More (Buslaev et al., 2020). In the context of QA, gener-
recently, with the rise of large scale datasets (Her- ating synthetic questions to complete a dataset has
mann et al., 2015), and large pre-trained models shown to improve QA performances (Duan et al.,
(Devlin et al., 2019), the performance drastically in- 2017; Alberti et al., 2019). So far, all these works
creased, approaching human-level performance on have focused on English QA given the difficulty
standard benchmarks (see for instance the SQuAD to generate questions in other languages without
leader board 1 . More challenging evaluation bench- available data. This lack of data and the difficulty
marks have recently been proposed: Dua et al. to obtain some constitutes the main motivation of
(2019) released the DROP dataset, for which the our work and justify exploring cost-effective ap-
annotators are encouraged to annotate adversarial proaches such as data augmentation via the genera-
questions; Burchell et al. (2020) released the MSQ tion of questions.
dataset, consisting of multi-sentence questions.
However, all these works are focused on English. 3 Data
Another popular research direction focuses on the
development of multilingual QA models. For this 3.1 English Training Data
purpose, the first step has been to provide the com-
SQuADen The original SQuAD (Rajpurkar et al.,
munity with multilingual evaluation sets: Artetxe
2016), which we refer as SQuADen for clarity in
et al. (2020) and Lewis et al. (2020) proposed con-
this paper. It is one of the first, and among the
currently two different evaluation sets which are
most popular, large scale QA datasets. It contains
comparable to the SQuAD development set. Both
about 100K question/paragraph/answer triplets in
reach the same conclusion: due to the lack of non-
English, annotated via Mechanical Turk.2
English training data, models don’t achieve the
same performance in Non-English languages than
they do in English. To the best of our knowledge, QG datasets Any QA dataset can be reversed
no method has been proposed to fill this gap. into a QG dataset, by switching the generation tar-
gets from the answers to the questions. In this
Question Generation QG can be seen as the paper, we use the qg subscript to specify when the
dual task of QA: the input is composed of the an- dataset is used for QG (e.g. SQuADen;qg indicates
swer and paragraph containing it, and the model is the English SQuAD data in QG format).
trained to generate the question. Proposed by (Rus
et al., 2010), it has leveraged on the development of 2
Two versions of SQuAD have been released: v1.1, used
new QA datasets (Zhou et al., 2017; Scialom et al., in this work, and v2.0. The latter contains “unanswerable ques-
tions” in addition to those from v1.1. We use the former, since
1
https://rajpurkar.github.io/ the multi-lingual evaluation datasets, MLQA and XQUAD, do
SQuAD-explorer/ not include unanswerable questions.

3.2 Synthetic Training Sets PIAF Keraron et al. (2020) provided an evalua-
SQuADtrans is a machine translated version of tion set in French following the SQuAD protocol,
the SQuAD train set in the seven languages of and containing 3835 examples.
MLQA, released by the authors together with their
KorQuAD 1.0 the Korean Question Answering
paper.
Dataset (Lim et al., 2019), a Korean dataset also
WikiScrap We collected 500 Wikipedia articles, built following the SQuAD protocol.
following the SQuADen protocol, for all the lan-
guages present in MLQA. They are not paired with SQuAD-it Derived from SQuADen , it was ob-
any question or answer. We use them as contexts tained via semi-automatic translation to Italian
to generate synthetic multilingual questions, as de- (Croce et al., 2018).
tailed in Section 4.2. We use project Nayuki’s
code3 to parse the top 10K Wikipedia pages ac-
4 Models
cording to the PageRank algorithm (Page et al., Recent works (Raffel et al., 2020; Lewis et al.,
1999). We then filter out the the paragraphs with 2019) have shown that classification tasks can be
character length outside of a [500, 1500] interval. framed as a text-to-text problem, achieving state-
Articles with less than 5 paragraphs are discarded, of-the-art results on established benchmark, such
since they tend to be less developed, in a lower as GLUE (Wang et al., 2018). Accordingly, we
quality or being only redirection pages. Out of employ the same architecture for both Question
the filtered articles, we randomly selected 500 per Answering and Generation tasks. This also allows
language. fairer comparisons for our purposes, by removing
3.3 Multilingual Evaluation Sets differences between QA and QG architectures and
their potential impact on the results obtained. In
XQuAD (Artetxe et al., 2020) is a human trans- particular, we use a distilled version of XLM-R
lation of the SQuADen development set in 10 lan- (Conneau et al., 2020): MiniLM-M (Wang et al.,
guages (Arabic, Chinese, German, Greek, Hindi, 2020) (see Section 4.3 for further details).
Russian, Spanish, Thai, Turkish, and Vietnamese),
with 1k QA pairs for each language. 4.1 Baselines
MLQA (Lewis et al., 2020) is an evaluation QANo-synth Following previous works, we fine-
dataset in 7 languages (English, Arabic, Chinese, tuned the multilingual models on SQuADen , and
German, Hindi, and Spanish). The dataset is built consider them as our baselines.
from aligned Wikipedia sentences across at least
two languages (full alignment between all lan- English as Pivot Leveraging on translation mod-
guages being impossible), with the goal of provid- els, we consider a second baseline method, which
ing natural rather than translated paragraphs. The uses English as a pivot. First, both the question in
QA pairs are manually annotated on the English language Lq and the paragraph in language Lp are
sentences and then human translated on the aligned translated into English. We then invoke the base-
sentences. The dataset contains about 46K aligned line model described above, QANo-synth , to predict
QA pairs in total. the answer. Finally, the predicted answer is trans-
lated back into the target language Lp . We used the
Language-specific benchmarks In addition to google translate API.4
the two aforementioned multilingual evaluation
data, we benchmark our models on three language- 4.2 Synthetic Data Augmentation
specific datasets for French, Italian and Korean,
In this work we consider data augmentation via
as detailed below. We choose these datasets since
generating synthetic questions, to improve the QA
none of these languages are present in XQuAD
performance. Different training schemes for the
or MLQA. Hence, they allow us to evaluate our
question generator are possible, resulting in dif-
models in a scenario where the target language is
ferent quality of synthetic data. Before this work,
not available during training, even for the synthetic
its impact on the final QA system remained unex-
questions.
plored, in the multilingual context.
3
https://www.nayuki.io/page/
4
computing-wikipedias-internal-pageranks https://translate.google.com

For all the following experiments, only the syn- our models, we used the official MLQA evaluation
thetic data changes. Given a specific set of syn- scripts.6
thetic data, we always follow the same two-stages
protocol, similar to (Alberti et al., 2019): we first 5 Results
train the QA model on the synthetic QA data, then 5.1 Question Generation
on SQuADen . We also tried to train the QA model
in one stage, with all the synthetic and human data We report examples of generated questions in Ta-
shuffled together, but observed no improvements ble 1.
over the baseline. Controlling the Target Language In the con-
We explored two different synthetic generation text of multilingual text generation, controlling the
modes: target language is not trivial.
Synth the QG model is trained on SQuADen,qg When a QA model is trained only on English
(i.e., English data only) and the synthetic data are data, at inference, given a non-English paragraph,
generated on WikiScrap. Under this setup, the only it predicts the answer in the input language, as one
annotated samples this model has access to are would expect, since it is an extractive process. Ide-
those from SQuAD-en. ally, we would like to observe the same behavior for
a Question Generation model trained only on En-
Synth+trans the QG model is trained on glish data (such as Synth), leveraging on the mul-
SQuADtrans,qg in addition to SQuADen,qg . The tilingual pre-training. Conversely to QA, QG is a
questions can be in a different languages than language generation task. Multilingual generation
the context. Hence, the model needs an indica- is much more challenging, as the model’s decoding
tion about the language it is expected to gener- ability plays a major role. When a QG model is
ate the question. To control the target language, fine-tuned only on English data (i.e SQuAD-en), its
we use a specific prompt per language, defin- controllability of the target language suffers from
ing a special token , which corresponds catastrophic forgetting: the input language do not
to the desired target language Y . Thus, the in- propagate to the generated text. The questions are
put is structured as Answer generated in English, while still being related to the
Context where is a special to- context. For instance, in Table 1, we observe that
ken separator, and indicates to the model the QGsynth model outputs English questions for
in what language the question should be generated. the paragraphs in Chinese and Spanish. The same
These attributes offer flexibility on the target lan- phenomenon was reported by Chi et al. (2020).
guage. Similar techniques are used in the literature
to control the style of the output (Keskar et al., Cross-language training To tackle the afore-
2019; Scialom et al., 2020; Chi et al., 2020). mentioned limitation on target language control-
lability (i.e. to enable the generation in other lan-
4.3 Implementation details guages than English), multilingual data is needed.
For all our experiments we use Multilingual We can leverage on the translated version of the
MiniLM v1 (MiniLM-m) (Wang et al., 2020), a dataset to add the needed non-English examples.
12-layer with 384 hidden size architecture distilled As detailed in Section 4.2, we simply use a spe-
from XLM-R Base multilingual (Conneau et al., cific prompt that corresponds to the target language
2020). With only 66M parameters, it is an order (with N different prompts corresponding to the N
of magnitude smaller than state-of-the-art archi- languages present in the dataset). In Table 1, we
tectures such as BERT-large or XLM-large. We show how QGsynth+trans can generate questions in
used the official Microsoft implementation.5 For the same language as the input. Further, the ques-
all the experiments—both QG and QA—we trained tions seem much more relevant, coherent and fluent,
the model for 5 epochs, using the default hyper- if compared to those produced by QGsynth : for the
parameters. We used a single nVidia gtx2080ti Spanish paragraph, the question is well formed and
with 11G RAM, and the training times to circa 4 focused on the input answer; for the Chinese (see
and 2 hours amount respectively for Question Gen- last row of Table 1 for QGsynth+trans ) is perfectly
eration and for Question Answering. To evaluate written.
5 6
Publicly available at https://github.com/ https://github.com/facebookresearch/
microsoft/unilm/tree/master/minilm. MLQA/blob/master/mlqa_evaluation_v1.py

Paragraph (EN) Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.
He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football
Operations and General Manager.
Answer Broncos
QGsynth What team did John Elway lead to victory at age 38?
QGsynth+trans (target language = en) What team did John Elway lead to win in the Super Bowl?
Paragraph (ES) Peyton Manning se convirtió en el primer mariscal de campo de la historia en llevar a dos equipos
diferentes a participar en múltiples Super Bowls. Ademas, es con 39 años, el mariscal de campo más longevo de la
historia en jugar ese partido. El récord anterior estaba en manos de John Elway —mánager general y actual vicepresidente
ejecutivo para operaciones futbolı́sticas de Denver— que condujo a los Broncos a la victoria en la Super Bowl XXXIII a
los 38 años de edad.
Answer Broncos
QGsynth Where did Peyton Manning condujo?
QGsynth+trans (target language = es) Qué equipo ganó el récord anterior? (Which team won the previous record?)
QGsynth+trans (target language = en) What team did Menning win in the Super Bowl?
Paragraph (ZH) 培顿·曼宁成为史上首位带领两支不同球队多次进入超级碗的四分卫。他也以39 岁高龄参加超
级碗而成为史上年龄最大的四分卫。过去的记录是由约翰·埃尔维保持的，他在38岁时带领野马队赢得第33 届
超级碗，目前担任丹佛的橄榄球运营执行副总裁兼总经理
Answer 野马队
QGsynth What is the name for the name that the name is used?
QGsynth+trans (target language = zh) 约翰·埃尔维在13岁时带领哪支球队赢得第33届超级碗? (Which team did John
Elvey lead to win the 33rd Super Bowl at the age of 13?)
QGsynth+trans (target language = en) What team won the 33th Super Bowl?

Table 1: Example of questions generated by the different models on an XQuAD’s paragraph in different languages.
For QGsynth+trans , we report the outputs given two target languages, the one of the context and English.

5.2 Question Answering consistent. Hence, the model was never exposed
We report the main results of our experiments on to different languages across the same input text.
XQuAD and MLQA in Table 2. The scores corre- The synthetic inputs are composed of questions
spond to the average over all the different possible mostly in English (see examples in Table 1) while
combination of languages (de-de, de-ar, etc.). the contexts can be in any languages. Therefore,
the QA model is exposed for the first time to a
English as Pivot Using English as pivot does not cross-lingual scenario. We hypothesise that such
lead to good results. This may be due to the evalu- a cross-lingual ability is not innate for a default
ation metrics based on n-grams similarity. For ex- multilingual model: exposing a model to this sce-
tractive QA, F1 and EM metrics measure the over- nario allows to develop this ability and contributes
lap between the predicted answer and the ground to improve its performance.
truth. Therefore, meaningful answers worded dif-
Synthetic with translation (+synth+trans) :
ferently are penalized, a situation that is likely to
For MiniLM+synth+trans , we obtain a much larger im-
occur because of the back-translation mechanism.
provement over its baselines, MiniLM, compared
Therefore, automatic evaluation is challenging for
to MiniLM+synth on both MLQA and XQuAD. This
this setup, and suffer from similar difficulties than
confirms the intuition developed in the paragraph
in text generation (Sulem et al., 2018). As an ad-
above, that independently of the multilingual ca-
ditional downside, it requires multiple translations
pacity of the model, a cross-lingual ability is de-
at inference time. For these reasons, we decided to
veloped when the two inputs components are not
not explore further this approach.
exclusively written in the same language. In the
Synthetic without translation (+synth) Com- Section 5.3, we discuss more in depth this phe-
pared to the MiniLM baseline, we observe a small nomenon.
increase in term of performance for MiniLM+synth
(the Exact Match increases from 29.5 to 33.1 on 5.3 Discussion
XQuAD and from 26.0 to 27.5 on MLQA). Cross Lingual Generalisation To explore the
During the self-supervised pre-training stage, the models’ effectiveness in dealing with cross lingual
model was exposed to multilingual inputs. Yet, inputs, we report in Figure 1 the performance for
for a given input, the target language was always our MiniLM+synth+trans setup, varying the number of

#Params                      Trans.                  XQuAD                                                      MLQA
                                            MiniLM (Wang et al., 2020)                                                                     66M                     No                42.2 / 29.5                                                    38.4 / 26.0
                                            XLM (Hu et al., 2020)7                                                                        340M                     No                68.4/53.0                                                      65.4 / 47.9
                                            English as Pivot                                                                               66M                     Yes                                                                              36.1 / 23.0
                                            M iniLM+synth                                                                                 66M                      No                44.8 / 33.1                                                    39.8 / 27.5
                                            M iniLM+synth+trans                                                                           66M                      Yes               63.3 / 49.1                                                    56.1 / 41.4
                                            XLM+synth+trans                                                                               340M                     Yes               74.3 / 59.2                                                    65.3 / 49.2

                                               Table 2: Results (F1 / EM) of the different QA models on XQuAD and MLQA.

                                                                                                                                                                                                                                                                                     Not All Languages (f1 Cross)
             55.0                                                                                                                                                                                                               40.0%                                                All Languages (f1 Cross)
                                                                                                                60

                                                                                                                                                                                                    Relative Improvement (f1)
             52.5
                                                                                                                                                                                                                                30.0%
             50.0
                                                                                                                55
                                                                                                   Score (f1)
Score (f1)

             47.5                                                                                                                                                                                                               20.0%
             45.0                                                                                               50
                                                                                                                                                                                                                                10.0%
             42.5
                                                           Not all languages (f1 Cross)                         45                                          Not all languages (f1 Cross)
             40.0                                                                                                                                                                                                                0.0%
                                                           All languages (f1 Cross)                                                                         All languages (f1 Cross)

                                                                                                                                                                                                                                        0 to 125K

                                                                                                                                                                                                                                                                8K

                                                                                                                                                                                                                                                                                7K

                                                                                                                                                                                                                                                                                                     7K

                                                                                                                                                                                                                                                                                                           (Synth Cr 6K
                                                                                                                                                                                                                                                                                                                    oss )
                                                                                                                                   125K

                                                                                                                                                208K

                                                                                                                                                           277K

                                                                                                                                                                       347K
                                  125K

                                               208K

                                                          277K

                                                                      347K

                                                                                                                     (Bas0eline)
                    (Bas0eline)

                                                                                                                                                                                            oss )
                                                                                           oss )

                                                                                                                                                                                                                                                     125K to 20

                                                                                                                                                                                                                                                                     208K to 27

                                                                                                                                                                                                                                                                                          277K to 34

                                                                                                                                                                                                                                                                                                          347K to 41
                                                                                                                                                                                  (Synth6KCr
                                                                                 (Synth6KCr

                                                                                                                                                                                     41
                                                                                    41

                                         Number of synthetic questions                                                                    Number of synthetic questions                                                                                     Number of synthetic questions

Figure 1: Left: F1 score on MLQA, for models with different number of synthetic data in two setups: for All Lan-
guages, the synthetic questions are sampled among all the 5 different MLQA’s languages; for Not All Languages,
the synthetic questions are sampled progressively from only one language, two, . . . , to all 5 for the last point, which
corresponds to All Languages. Middle: same as on the left but evaluated on XQuAD. Right: The relative variation
in performance for these models.

samples and the languages present in the synthetic                                                                                                          However, it seems that even with only one lan-
data. The abscissa x corresponds to the progres-                                                                                                         guage pair present, the model is able to develop a
sively increasing number of synthetic samples used;                                                                                                      cross lingual ability that benefits –in part– on other
at x = 0 corresponds to the baseline MiniLM+trans ,                                                                                                      languages: at the right-hand side of Figure 1, we
where the model has access only to the original                                                                                                          can see that most of the improvement is happen-
English data from SQuADen . We explore two sam-                                                                                                          ing given only one cross-lingual language pair (i.e.
pling strategies for the synthetic examples:                                                                                                             English and Spanish).

             1. All Languages corresponds to sampling the                                                                                                Unseen Languages To measure the benefit of
                examples from any of the different languages.                                                                                            our approach on unseen languages (i.e. lan-
                                                                                                                                                         guage not present in the synthetic data from
             2. Conversely, for Not All Languages, we pro-                                                                                               MLQA/XQuAD), we test our models on three QA
                gressively added the different languages: for                                                                                            evaluation sets: PIAF (fr), KorQuAD and SQuAD-
                x = 125K, all the 125K synthetic data are on                                                                                             it (see Section 3.3. The results are consistent with
                a Spanish input. Then German in addition of                                                                                              the previous experiments on MLQA and XQuAD.
                Spanish; Hindu; finally, Arab, all the MLQA                                                                                              Our MiniLM+synth+trans model improved by more
                languages are present.                                                                                                                   than 4 Exact Match points over its baseline, while
                                                                                                                                                         XLM+synth+trans obtains a new state-of-the-art. No-
   In Figure 1, we observe that the performance for                                                                                                      tably, our multilingual XLM+synth+trans outperforms
All Language Cross increases largely at the begin-                                                                                                       CamemBERT on PIAF, while the later is a pure
ning, then remains mostly stable. Conversely, we                                                                                                         monolingual, in domain language model.
can observe a gradually improvement for Not All
Language Cross, as more languages are present in                                                                                                        Hidden distillation effect The relative im-
the training. It indicates that when all the languages                                                                                                  provement for our best synthetic configuration
are present in the synthetic data, the model obtains                                                                                                    +synth+trans over the baseline, is superior to 60%
immediately a cross-lingual ability.                                                                                                                    EM for MiniLM (from 29.5 to 49.5 on XQuAD

#Params PIAF (fr) KorQuAD SQuAD-it
MiniLM 66M 58.9 / 34.3 53.3 / 40.5 72.0 / 57.7
BERT-m (Croce et al., 2018) 110M - - 74.1 / 62.5
CamemBERT (Martin et al., 2020) 340M 68.9 / - - -
MiniLM+synth 66M 58.6 / 34.5 52.1 / 39.0 71.3 / 58.0
MiniLM+synth+trans 66M 63.9 / 40.6 60.0 / 48.8 74.5 / 62.0
XLM+synth+trans 340M 72.1 / 47.1 63.0 / 52.8 80.4 / 67.6

Table 3: Zero-shot results (F1 / EM) on PIAF, KorQuAD and SQuAD-it for our different QA models, compared to
various baselines. Note that CamemBERT actually corresponds to a French version of RoBERTa, an architecture
widely outperforming BERT.

and from 26.0 to 41.4 on MLQA). It is way more that the model might adopt the correct language at
important than for XLM, where the EM improved inference, even for a target language unseen during
by +11.7% on XQuAD and +2.71% on MLQA. training.8 A similar intuition has been explored in
It indicates that XLM contains more cross-lingual GPT-2 (Radford et al., 2019): the authors report
transfer abilities than MiniLM. Since the latter is a an improvement for summarization when the input
distilled version of the former, we hypothesize that text is followed by “TL;DR” (i.e. Too Long Didn’t
such abilities may have been lost during the distil- Read).
lation process. Such loss of generalisation can be At inference time, we evaluated this approach on
difficult to identify, and opens questions for future French with the prompt language=Français.
work. Unfortunately, the model did not succeed to gener-
ate text in French.
On target language control for text generation Controlling the target language in the context
When relying on the translated versions of SQuAD, of multilingual text generation remains under ex-
the target language for generating synthetic can plored, and progress in this direction could have
easily be controlled, and results in fluent and rel- direct applications to improve this work, and be-
evant questions in the various languages (see Ta- yond.
ble 1). However, one limitation of this approach
is that synthetic questions can only be generated 6 Conclusion
in the languages that were available during train-
ing: the prompts are special tokens that In this work, we presented a method to generate
are randomly initialised when fine-tuning QG on synthetic QA dataset in a multilingual fashion and
SQuADtrans;qg . They are not semantically related, showed how QA models can benefit from it, and re-
before fine-tuning, to the name of corresponding porting large improvements over the baseline. Our
language (“English”, “Español” etc.) and, thus, the method contributes to fill the gap between English
learned representation for the tokens is and other languages. Nonetheless, our method has
limited to the languages present in the training set. shown to generalize even for languages not present
To the best of our knowledge, no method allows in the synthetic corpus (e.g. French, Italian, Ko-
so far to generalize this target control to unseen rean).
language. It would be valuable, for instance, to be In future work, we plan to investigate if the pro-
able to generate synthetic data in Korean, French posed data augmentation method could be applied
and Italian, without having to translate the entire to other multilingual tasks, such as classification.
SQuAD−en dataset in these three languages to We will also experiment more in depth with dif-
then fine-tune the QG model. ferent strategies to control the target language of a
To this purpose, we report –alas, as a negative model, and extrapolate on unseen ones. We believe
result– the following attempt: instead of control- that this difficulty to extrapolate might highlight an
ling the target language with a special, randomly important limitation of current language models.
initialised, token, we used a token semantically re- 8
With unseen during training, we mean not present in the
lated to the language-word: “English”, “Español” QG dataset; obviously, the language should have been present
for Spanish, or “中文” for Chinese. The intuition is in the first self-supervised stage.

Acknowledgments ficial Intelligence, pages 389–402, Cham. Springer
International Publishing.
Djamé Seddah was partly funded by the ANR
project PARSITI (ANR-16-CE33-0021), Arij Ri- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
abi was partly funded by Benoı̂t Sagot’s chair in Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
the PRAIRIE institute as part of the French na- standing. In Proceedings of the 2019 Conference
tional agency ANR “Investissements d’avenir” pro- of the North American Chapter of the Association
gramme (ANR-19-P3IA-0001). for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
References ation for Computational Linguistics.

Chris Alberti, Daniel Andor, Emily Pitler, Jacob De- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-
vlin, and Michael Collins. 2019. Synthetic QA cor- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
pora generation with roundtrip consistency. In Pro- and Hsiao-Wuen Hon. 2019. Unified language
ceedings of the 57th Annual Meeting of the Asso- model pre-training for natural language understand-
ciation for Computational Linguistics, pages 6168– ing and generation. In Advances in Neural Informa-
6173, Florence, Italy. Association for Computa- tion Processing Systems, pages 13063–13075.
tional Linguistics.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. Stanovsky, Sameer Singh, and Matt Gardner. 2019.
2020. On the cross-lingual transferability of mono- DROP: A reading comprehension benchmark requir-
lingual representations. In Proceedings of the 58th ing discrete reasoning over paragraphs. In Proceed-
Annual Meeting of the Association for Computa- ings of the 2019 Conference of the North American
tional Linguistics, pages 4623–4637, Online. Asso- Chapter of the Association for Computational Lin-
ciation for Computational Linguistics. guistics: Human Language Technologies, Volume 1
Laurie Burchell, Jie Chi, Tom Hosking, Nina (Long and Short Papers), pages 2368–2378, Min-
Markl, and Bonnie Webber. 2020. Querent in- neapolis, Minnesota. Association for Computational
tent in multi-sentence questions. arXiv preprint Linguistics.
arXiv:2010.08980.
Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou.
Alexander Buslaev, Vladimir I Iglovikov, Eugene 2017. Question generation for question answering.
Khvedchenya, Alex Parinov, Mikhail Druzhinin, In Proceedings of the 2017 Conference on Empiri-
and Alexandr A Kalinin. 2020. Albumentations: cal Methods in Natural Language Processing, pages
fast and flexible image augmentations. Information, 866–874, Copenhagen, Denmark. Association for
11(2):125. Computational Linguistics.

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian- Germain Forestier, François Petitjean, Hoang Anh
Ling Mao, and Heyan Huang. 2020. Cross-lingual Dau, Geoffrey I Webb, and Eamonn Keogh. 2017.
natural language generation via pre-training. In Generating synthetic time series to augment sparse
AAAI, pages 7570–7577. datasets. In 2017 IEEE international conference on
data mining (ICDM), pages 865–870. IEEE.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- David Golub, Po-Sen Huang, Xiaodong He, and
moyer. 2018. QuAC: Question answering in con- Li Deng. 2017. Two-stage synthesis networks for
text. In Proceedings of the 2018 Conference on transfer learning in machine comprehension. In Pro-
Empirical Methods in Natural Language Processing, ceedings of the 2017 Conference on Empirical Meth-
pages 2174–2184, Brussels, Belgium. Association ods in Natural Language Processing, pages 835–
for Computational Linguistics. 844, Copenhagen, Denmark. Association for Com-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, putational Linguistics.
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
moyer, and Veselin Stoyanov. 2020. Unsupervised stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
cross-lingual representation learning at scale. In and Phil Blunsom. 2015. Teaching machines to read
Proceedings of the 58th Annual Meeting of the Asso- and comprehend. In Advances in neural information
ciation for Computational Linguistics, pages 8440– processing systems, pages 1693–1701.
8451, Online. Association for Computational Lin-
guistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-
ham Neubig, Orhan Firat, and Melvin Johnson.
Danilo Croce, Alexandra Zelenanska, and Roberto 2020. Xtreme: A massively multilingual multi-task
Basili. 2018. Neural learning for question answer- benchmark for evaluating cross-lingual generaliza-
ing in italian. In AI*IA 2018 – Advances in Arti- tion. arXiv preprint arXiv:2003.11080.

Rachel Keraron, Guillaume Lancrenon, Mathilde Bras, transformer. Journal of Machine Learning Research,
Frédéric Allary, Gilles Moyse, Thomas Scialom, 21(140):1–67.
Edmundo-Pavel Soriano-Morales, and Jacopo Sta-
iano. 2020. Project PIAF: Building a native French Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
question-answering dataset. In Proceedings of the Percy Liang. 2016. SQuAD: 100,000+ questions for
12th Language Resources and Evaluation Confer- machine comprehension of text. In Proceedings of
ence, pages 5481–5490, Marseille, France. Euro- the 2016 Conference on Empirical Methods in Natu-
pean Language Resources Association. ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
Caiming Xiong, and Richard Socher. 2019. Ctrl: A Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean,
conditional transformer language model for control- Svetlana Stoyanchev, and Christian Moldovan. 2010.
lable generation. arXiv preprint arXiv:1909.05858. The first question generation shared task evaluation
challenge. In Proceedings of the 6th International
Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Natural Language Generation Conference.
Ganesh Ramakrishnan, and Preethi Jyothi. 2019.
Cross-lingual training for automatic question gen- Thomas Scialom, Benjamin Piwowarski, and Jacopo
eration. In Proceedings of the 57th Annual Meet- Staiano. 2019. Self-attention architectures for
ing of the Association for Computational Linguis- answer-agnostic neural question generation. In Pro-
tics, pages 4863–4872, Florence, Italy. Association ceedings of the 57th Annual Meeting of the Asso-
for Computational Linguistics. ciation for Computational Linguistics, pages 6027–
6032, Florence, Italy. Association for Computa-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- tional Linguistics.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Thomas Scialom, Serra Sinem Tekiroglu, Jacopo Sta-
Bart: Denoising sequence-to-sequence pre-training iano, and Marco Guerini. 2020. Toward stance-
for natural language generation, translation, and based personas for opinionated dialogues. arXiv
comprehension. arXiv preprint arXiv:1910.13461. preprint arXiv:2010.03369.
Robert F Simmons. 1965. Answering english ques-
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian
tions by computer: a survey. Communications of
Riedel, and Holger Schwenk. 2020. MLQA: Evalu-
the ACM, 8(1):53–70.
ating cross-lingual extractive question answering. In
Proceedings of the 58th Annual Meeting of the Asso- Elior Sulem, Omri Abend, and Ari Rappoport. 2018.
ciation for Computational Linguistics, pages 7315– BLEU is not suitable for the evaluation of text sim-
7330, Online. Association for Computational Lin- plification. In Proceedings of the 2018 Conference
guistics. on Empirical Methods in Natural Language Process-
ing, pages 738–744, Brussels, Belgium. Association
Seungyoung Lim, Myungji Kim, and Jooyoul Lee. for Computational Linguistics.
2019. Korquad1. 0: Korean qa dataset for ma-
chine reading comprehension. arXiv preprint Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
arXiv:1909.07005. ris, Alessandro Sordoni, Philip Bachman, and Ka-
heer Suleman. 2017. NewsQA: A machine compre-
Louis Martin, Benjamin Muller, Pedro Javier Or- hension dataset. In Proceedings of the 2nd Work-
tiz Suárez, Yoann Dupont, Laurent Romary, Éric shop on Representation Learning for NLP, pages
de la Clergerie, Djamé Seddah, and Benoı̂t Sagot. 191–200, Vancouver, Canada. Association for Com-
2020. CamemBERT: a tasty French language model. putational Linguistics.
In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages Alex Wang, Amanpreet Singh, Julian Michael, Fe-
7203–7219, Online. Association for Computational lix Hill, Omer Levy, and Samuel Bowman. 2018.
Linguistics. GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In Pro-
Lawrence Page, Sergey Brin, Rajeev Motwani, and ceedings of the 2018 EMNLP Workshop Black-
Terry Winograd. 1999. The pagerank citation rank- boxNLP: Analyzing and Interpreting Neural Net-
ing: Bringing order to the web. Technical report, works for NLP, pages 353–355, Brussels, Belgium.
Stanford InfoLab. Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao,
Dario Amodei, and Ilya Sutskever. 2019. Language Nan Yang, and Ming Zhou. 2020. Minilm: Deep
models are unsupervised multitask learners. OpenAI self-attention distillation for task-agnostic compres-
blog, 1(8):9. sion of pre-trained transformers. arXiv preprint
arXiv:2002.10957.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
Wei Li, and Peter J Liu. 2020. Exploring the lim- Hangbo Bao, and Ming Zhou. 2017. Neural ques-
its of transfer learning with a unified text-to-text tion generation from text: A preliminary study. In

National CCF Conference on Natural Language
Processing and Chinese Computing, pages 662–671.
Springer.

You can also read

A THEORY OF TAGGED OBJECTS - JOSEPH LEE1, JONATHAN ALDRICH1, TROY SHAW2, AND ALEX POTANIN2

Potential of Voice Recording Tools in Language Instruction

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Composition of the Entry in a Bilingual Dictionary

Spatially Aware Multimodal Transformers for TextVQA

THE REGULATIONS OF THE IGNACY ŁUKASIEWICZ SCHOLARSHIP PROGRAMME - second-cycle degree programmes - Warsaw, 6 March 2019 - NAWA

Economic Impact Evaluation of the City of Saint Paul's Minimum Wage Ordinance - November 1, 2021 - Principal Investigators: Loukas Karabarbounis ...

Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection

JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES - Journal of Language and Linguistic ...

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

PHILIPPINES Introduction The momentum towards restoring freedom of expression, and the press freedom that Filipinos gained with the fall of ...

FR2041 Handbook 2018-2019 - French Department F - Trinity College Dublin

ON THE USE OF LINGUISTIC SIMILARITIES TO IMPROVE NEURAL MACHINE TRANSLATION FOR AFRICAN LANGUAGES

Evaluating Indirect Strategies for Chinese-Spanish Statistical Machine Translation

JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES - Journal of Language and Linguistic ...

THE GRANDE ECOLE FOR MULTILINGUAL AND INTERCULTURAL EXPERTISE - siNCE 1957 - UNDERGRADUATE COURSES AND STATE-ACCREDITED MASTER'S DEGREE - ISIT

Regensburger DISKUSSIONSBEITRÄGE

A Lightweight Framework for Regular Expression Verification

A Pseudo-Value Approach to Analyze the Semantic Similarity of the Speech of Children With and Without Autism Spectrum Disorder

Expression of Language and Culture Through Homecomers' (Travellers) Funny Memes - IJICC

Learning Outcomes and Self-Perceived Changes Among Japanese University Students Studying English in the Philippines Risa Ikeda - TESL-EJ

Blended language learning - Language Research

Multilingual Development - a Longitudinal Study - Revistas UM

Subtitling and Dubbing as Teaching Resources in CLIL in Primary Education: The Teachers' Perspective

Architecture of the .NET Framework

English Standard Paper 1 - Texts and Human Experiences - NESA

SI/PI-Database of PCB-Based Interconnects for Machine Learning Applications

MIMS Study abroad - Lisa Schwabe & Bert Kiel - FH Münster

TRANSFER ADMISSION 2019 GUIDE YOUR GUIDE - through the transfer admission process - UCLA Undergraduate Admission

JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES

How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech - SFB 1102

Experimental UAV Data Traffic Modeling and Network Performance Analysis

Reliability-Enhanced Camera Lens Module Classification Using Semi-Supervised Regression Method - MDPI

DREF Operation Update no. 1 - International Federation of ...

CIDEr-R: Robust Consensus-based Image Description Evaluation

Attend and Select: A Segment Attention based Selection Mechanism for Microblog Hashtag Generation

Distilling Large Language Models into Tiny and Effective Students using pQRNN

Deep CNN-based Speech Balloon Detection and Segmentation for Comic Books

PICA: A Pixel Correlation-based Attentional Black-box Adversarial Attack

Photo-Realistic Facial Details Synthesis from Single Image - arXiv

The Missing Ingredient in Zero-Shot Neural Machine Translation

Multilingual Content in Teaching the Kazakh Language Courses

MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark - OpenReview

MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark

Language Documentation and Description - Language Documentation ...

An Investment Process for a New Reality - THREE TRENDS SHAPING THE BUY-SIDE IN 2021 AND BEYOND - BNY Mellon

CaloriNet: From silhouettes to calorie estimation in private

The Value of Foresight - Generating Value Through Integrated Predictive Maintenance - Oracle

T7 Release 9.1 Participant Simulation Guide Derivatives and Cash Markets - Eurex

THE REPORT ON THE STATUS OF WOMEN AND GIRLS