Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation

Page created by Dave Lynch

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Zero-Shot Language Transfer vs Iterative Back Translation for
                                                              Unsupervised Machine Translation

                                                                Aviral Joshi, Chengzhi Huang, Har Simrat Singh
                                                           (aviralj, chengzhh, harsimrs)@andrew.cmu.edu

                                                               Abstract                              3. Finally, we have some parallel data for
                                                                                                        X-Z and monolingual data for X and Z.
                                              This work focuses on comparing differ-                    In this scenario semi-supervised approaches
                                              ent solutions for machine translation on                  that leverage both monolingual and parallel
arXiv:2104.00106v1 [cs.CL] 31 Mar 2021

                                              low resource language pairs, namely, with                 data can outperform the 2 other approaches
                                              zero-shot transfer learning and unsuper-                  mentioned.
                                              vised machine translation. We discuss
                                              how the data size affects the performance          Here X represents the source language (the lan-
                                              of both unsupervised MT and transfer               guage to transfer from), Z represents the target
                                              learning. Additionally we also look at how         language (the language to transfer to) and Y repre-
                                              the domain of the data affects the result of       sents a pivot language (an intermediate language)
                                              unsupervised MT. The code to all the ex-           which is similar to the target language Z. In a low
                                              periments performed in this project are ac-        resource scenario there is little to no parallel data
                                              cessible on Github.                                between X and Z.
                                                                                                    The first 2 data availability scenarios mentioned
                                         1     Introduction                                      above are generally more challenging due to lack
                                                                                                 of presence of parallel data between source and
                                         Unsupervised Machine translation(UMT) is the            target languages and are explored in greater de-
                                         task of translating sentences from a source lan-        tail in this work. Futhermore, our report also dis-
                                         guage to a target language without the help of par-     cuss the impact of mismatch in dataset domains
                                         allel data for training. UMT is especially useful       for training and evaluation and its effect on trans-
                                         when translating to and from low resource lan-          lation. Our report is structured as follows. We
                                         guages for example English to Nepali or vice-           first introduce related work in the domain of unsu-
                                         versa. Since it is assumed that there is no access to   pervised machine translation with back translation
                                         parallel data between two languages, unsupervised       and language transfer while also discussing works
                                         MT techniques focus on leveraging monolingual           that detail the impact of domain mismatch in train-
                                         data to learn a translation model. More generally       ing and evaluation datasets for UMT in Section 2.
                                         there are three data availability scenarios that can    After which we describe the experimental setup in
                                         dictate the choice of the training procedure and the    Section 3. Moving on we describe the results of
                                         translation model to be used. These scenarios are       our experiments on low-resource language pairs
                                         as follows:                                             with varying training data sizes and also analyze
                                                                                                 the impact of out of domain datasets on the qual-
                                             1. We have no parallel data for X-Z but have a      ity of translation in Section 4. Finally, we con-
                                                large monolingual corpus. Unsupervised Ma-       clude with possible directions for future work in
                                                chine translation is very well suited for this   the domain of UMT for low-resource languages.
                                                scenario.
                                                                                                 2     Related Work
                                             2. We do not have parallel data for X-Z but
                                                we have parallel data for Y-Z and a good         2.1    Unsupervised Machine Translation Using
                                                amount of monolingual data. Zero-Shot lan-              Monolingual Data
                                                guage transfer can be more effective in this     Recent advancement in Machine Translation can
                                                scenario.                                        be attributed to the development of training tech-

niques that allow deep learning models to uti- source and target language. To map the latent
lize large-scale monolingual data for improving DAE representations of the two languages into
performance on translation while reducing de- the same space. To aid translation the authors
pendency on massive amounts of parallel data. also introduce an additional adversarial loss term.
Of the several attempts to leverage monolin- The DAE along with iterative back-translation
gual data three techniques – back-translation, lan- is utilized to train a completely unsupervised
guage modelling and Auto-Encoding – in particu- translation system.
lar standout as the most effective and widely ac-
cepted strategies to do so. 2.2 Transfer Learning - Zero shot
Translation
Sennrich et. al. [20] first introduced the idea
of iteratively improving the quality of a machine Even though UMT has shown to perform well on
translation model in a semi-supervised setting by similar language pairs with good amount of mono-
using “back-translation”, the idea revolved around lingual data, Artetxe et al. argued a scenario with-
training an initial sub-optimal translation system out any parallel data and abundant monolingual
with the help of available parallel data and then data is unrealistic in practice. One common alter-
utilizing it to translate many sentences from a large native in this scenario is to avoid using monolin-
monolingual corpus on the target side to the source gual data with zero-shot transfer learning. Zero-
language to generate augmented parallel-data for shot translation concerns the case where direct
training the original translation system. (src-tgt) parallel data is lacking but there is paral-
lel data between the target language and a similar
Gulcehre et. al [6] demonstrated an effec-
language to source language.
tive strategy to integrate language modelling as
Transfer Learning is firstly proposed for NMT
an auxiliary task with Machine Translation im-
by Zoph et al., who leverage a high resourced
proving the performance on translation quality be-
parent model to initialize the low-resourced child
tween both high resource languages and low re-
model. Nguyen and Chiang and Kocmi and Bo-
source language pairs. A more recent work by
jar utilizes shared vocabularies for source and tar-
Lample. et. al [14] introduced a Cross Lingual
get language to improve transfer learning, in ad-
Model(XLM) that utilized a Cross Lingual Lan-
dition, mitigate the vocabulary mismatch by us-
guage modelling pretraining objective to improve
ing cross-lingual word embedding. These meth-
performance on translation.
ods show great performance in transfer learning
Previous works have also looked at the task for low resource languages, they show limited ef-
of translation from the perspective of an Auto- fects on zero-shot transfer learning.
Encoder to utilize monolingual data. Cheng et. Multilinguality has been extensively studied
al.[3] introduced an approach which utilized an and has been applied to the related problem of
Auto-Encoder to reconstruct sentences in mono- zero-shot translation (Johnson et al.; Firat et al.;
lingual data where the encoder translates input Arivazhagan et al.). Liu et al. showed some initial
from source to target language and the decoder results on multilingual unsupervised translation
works to translate sentences back to the source lan- in the low-resource setting. Namely, Pretrained
guage. Parallel data is utilized in conjunction to mBART25 is finetuned on transfer language to tar-
monolingual data to learn the translation system. get language pair, and is directly to the source to
This method was shown to outperform the origi- target pairs. mBART obtained up to 34.1 (Nl-En)
nal back-translation system proposed by Sennrich with zero-shot translation and works well when
et. al. fine-tuning is also conducted in the same language
Unlike the previously described semi- family. Specifically, it acheives 23 BLEU points
supervised Lample et. al. [15] introduce an for Ro-En using Cs as the transfer language, and
effective strategy to perform completely unsu- 18.9 BLUE points for Ne-En using Hi as the trans-
pervised machine translation in the presence fer language.
of a large monolingual corpus. Their work
leverages a Denoising Auto-Encoder (DAE) to 2.3 Unsupervised Machine Translation on
reconstruct denoised sentences from the sentences Out-of-Domain Datasets
in monolingual corpora which are systematically Owing to increasing popularity of unsupervised
corrupted to train encoder and decoders for both MT and the techniques involved, adaptation of

UMT to translate data from domains where a large language pairs and also how variation in the
amount of parallel data is not obtainable is also monolingual data size effects the translation
gaining importance. One of the main issues as- quality for similar and dissimilar language
sociated with this is the feasibility of using pre- pairs.
trained models. Most of the pre-trained mod-
els, even though trained under supervised learn- 2. Zero Shot Language Transfer: Our sec-
ing paradigm, fail to achieve significant transla- ond experiement involves zero-shot transla-
tion accuracy on a domain which was non-existent tion using mBART model. We finetune on re-
in the training data. Unsupervised models also fol- lated language (X → En) and apply it directly
low the same pattern. Marchisio et al. [18] show to the source-target (Y → En) language pair.
some significant results and comment on the do- Here we illustrate how the size of finetun-
main similarity of test and training data in State of ing data affect both finetuning scores (fine-
the Art models. The authors use United Nations tuned on X → En and tested on X → En) and
Parallel Corpus (UN) [22], News Crawl and Com- transferring scores (finetuned on X → En and
monCrawl datasets for their work. The model was tested on Y → En).
trained solely on NewsCrawl while being evalu- 3. Domain Dissimilarity Our third experiment
ated on all 3 datasets. A general trend of reduc- analyses the performance of a model which is
tion in performance was observed across various trained on a different dataset and is tested on
language pairs. Extensive work has been done sentences from an entirely different domain.
to address this problem. Dou et al. [4] propose We use XML model and perform back trans-
a Domain-Aware feature embeddings based ap- lation on pre-trained models for all language
proach and achieve improvements over previous paris X → En to fine tune it on the required
existing methods. Luong and Manning [17] adopt dataset. Back translation steps include X →
an attention based approach to fine tune a pre- En and En → X. The model is then tested for
trained model on out of domain data. Sennrich the previously mentioned steps on a dataset
et al. [20] show that for out of domain data back sourced from a completely different domain.
translation helps the models learn a whole lot of
words previously non existent in the vocabulary 3.2 Models
which could be the reason why out of domain back
mBART [16] is used for experiments with zero-
translation based UMT fails. Our experiments fol-
shot transfer learning, XLM [14] is used for ex-
low a similar line of work and show that fine tun-
periments with unsupervised machine translation.
ing a pre-trained translation model does not effec-
tively help with out of domain data. 1. mBART is the first method for pre-training a
complete sequence to sequence model by de-
3 Experiments
noising full texts in multiple languages. Pre-
3.1 Proposed Experiments training is done on CC25 data set and pre-
In our experiments, we focus on how the size of trained model can be directly fine-tuned for
training data will affect the performance of both both supervised and unsupervised MT. It also
zero-shot transfer learning and Unsupervised Ma- enables a smooth zero-shot transfer to lan-
chine Translation with iterative back translation. guage pairs with no bi-text data. In our ex-
Additionally, we conduct an extensive empirical periments, we use mBART25, which is pre-
evaluation of unsupervised machine translation us- trained on all 25 lagnauges. Ideally, for a fair
ing dissimilar domains for monolingual training comparison unsupervised BT should be also
data between source and target languages and with done using mBART25, however, the authors
diverse datasets between training data and test did not provide their implementation for BT
data. and so we shift to the XLM model for exper-
iments with BT since its implementation was
1. Unsupervised MT using iterative back more accessible.
translation: Our first experiment involves
UMT using back translation(BT) with the 2. XLM is a cross lingual language model
XLM model. Here we demonstrate the fea- developed for both supervised and unsu-
sibility of UMT with BT for low-resource pervised machine translation tasks. XLM

model uses shared sub word vocabulary and cabulary. Romanian was chosen as a low re-
is trained on 3 different language mod- source language, distantly related to English. To
elling objectives of Causal language model- simulate domain mismatch, the XLM model was
ing, Masked Language modeling and Trans- pretrained on WMT’16 which mostly has paral-
lation language modeling. The model is es- lel data extracted from news articles. The model
sentially Transformer based with 1024 sub was then fine tuned on monolingual data of the
units and 8 heads. Cross lingual XLM is four languages obtained from the Tatoeba chal-
trained on 15 languages and is shown to lenge [21]. For French, English and German 50K
achive SoTA on all the different languages. monolingual sentences were used, and for Roma-
Our motivation behind using XLM was that a nian only 1999 sentences were available. Tatoeba
pretrained XLM model was available for the dataset has sentences with minimum overlap with
languages of our choice. It supported 2 way any other available machine translation dataset.
backtranslation and was comparatively easier We used the sentences extracted from Wikipedia
to train and fine tune than most other multi- pages present within the dataset. The validation
lingual models. and testing set was extracted from WMT’16.
3.3 Dataset 3.3.1 Experiment Details
Two language pairs were chosen for the experi-
We use the following tokenizers for both prepro-
ments where we conduct an investigation of the
cessing training data and evaluations for all three
impact of size of monolingual training data on
experiments:
translation quality between a similar and dissimi-
lar language pair (Ro (Romanian) - En (English)
• Ne, Hi: We use Indic-NLP Library to
and Ne (Nepali) - En respectively. For Ro &
tokenize both Indic language inputs and
En, data from wikipedia was used. We perform
outputs.[12]
random sampling to obtain monolingual sentences
from the monolingual data provided by the authors
of the XLM model (This was originally obtained • Ro, Cs, En: We apply Moses tokenization
by scraping articles from Wikipedia). For Nepali and special normalization for English, Czech
the monolingual data was obtained from the Flores and Romanian texts.
data corpus.
For zero-shot transfer learning, we conducted We conduct all the experiments for trans-
experiments on Ro to En with Cs (Czech) be- fer learning on AWS g4dn.xlarge instance,
ing the transfer language and Ne to En with Hi with max-tokens 512, total-num-update
(Hindi) being the transfer language. Cs - En par- 80000 and dropout 0.3. During the inference time,
allel training data is taken from Europarl v7 [11] beam size is set to 5.
with 668,595 sentence pairs, Ro - En test data is Similar to experiments with transfer learning all
taken from WMT16 test set for Romanian and En- experiments for Back Translation were conducted
glish, consisting of 1999 sentence pairs; Hi-En on a g4dn.xlarge instance the batch size was set to
parallel training data is taken from IITB Bombay 32 and multiple iterations of back translation were
English-Hindi corpus[13] with 1,609,682 sentence run. The number of tokens per batch were set to
pairs, Ne-En test data is taken from [7] with 2835 a maximum of 1500 and a embedding dimension
test sentences. of 1024 was used. Other hyper-parameters were
To analyse the effect of domain mismatch train- kept the same as the implementation provided by
ing and testing, we use Fr (French), Genrman the authors of XLM.
(De) and Romanian (Ro). In all the experiments For domain mismatch experiment we use XLM
the target language is English (En). Back trans- model as well. We follow the Byte Pair Encod-
lation is performed both ways. The choice for ing technique as used by the authors of XLM, and
these specific languages was motivated by the cri- use fastBPE for the same. Tokenization is done
terion on language similarity and availability of using Moses tokenization. The experiments were
data. French and German are related to English, conducted on AWS p3.2xlarge instances. The
albeit not closely but still possess a similarity in model used 6 layers of transformers with 8 heads.
terms of language family and closeness of vo- The embedding dimension was 1024.

3.4    Evaluation Metrics                                       BLEU - Back Translation
For all of our tasks, we use BLEU score calcu-            Data Size
                                                                 High Similarity Low Similarity
lated by sacrebleu as the automatic metric to                      en         ro   en       ne
evaluate the translation performance. Specifically,                →         ←     →        ←
we compute the BLEU scores over tokenized text              10K  2.76      3.93 0.0        0.0
for both hypothesis outputs and references as men-         100K  23.6      23.6 0.0        0.0
tioned in Liu et al.                                       300K  27.7      26.9 0.1        0.1
                                                           600K  28.8      28.3 0.1        0.2
4     Results and Discussions                              >10M  33.3      31.8 0.1        0.5
In this section we describe the results of the pro-      Table 1: Comparison of Translation Quality be-
posed experiments as mentioned previously in             tween Similar and Dissimilar language pairs and
Section 3 and discuss our finding to illustrate the      how they scale with increase in monolingual data.
the impact of dataset size and dataset domain for
Unsupervised Machine Translation and data size
for Transfer Learning.                                   randomly sample 1K, 10K, 100K and 300K sen-
                                                         tences from Europart v7 [11] of Cs - En parallel
4.1    Back Translation                                  training corpus and test the finetuning model di-
As expected with most deep learning approaches,          rectly on Ro - En test data; We also randomly sam-
the trained model scales with increase in train-         ple 10K, 100K, 300K and 600K sentences from
ing data (monolingual) as the BLEU show a gen-           ITTB Bombay-English corpus [13] and test the
eral upward trend as more data is made accessi-          finetuning model directly on Ne - En test data. We
ble to the XLM model. This is evident in the re-         present BLEU points of finetuning models on both
sults shown in Table 1 especially for the high sim-      source-target language pairs (X - En) and related-
ilarity English-Romanian language pair. Whereas,         target language pairs (Y - En) for better observing
the performance on low similarity English-Nepali         the correlation between them. Each experiment is
pair is degenerate in all data availability scenarios.   trained using the same hyper-parameters.
Even when the entire training corpus is provided
                                                                       Transfer    Finetune
to the model. This result indicates that it might not     Data Size    Ro → En     Cs → En
be feasible to perform unsupervised MT when lan-              0          0.0          0.0
guage pairs have little or no similarity using Back          1K          17.6        16.2
Translation along. In which case other approaches           10K          23.3        21.9
for Unsupervised MT such as language transfer               100K         23.0        25.7
are better suited. More about language transfer is          300K         20.3        26.5
discussed in Section 4.2.                                   668K         14.5        27.5
   A big improvement in BLEU score is seen
when training data size is increased from 10,000         Table 2: BLEU points for Transfer Learning
to 100,000 for the high similarity language pair.        and directly finetuning of different data size, pre-
The performance increases steadily as more data          trained mBART25 is finetuned on Cs to En and di-
is provided to the model while the rate of in-           rectly transferred to Ro to En, we choose the best
crease slows down. A reasonable BLEU for the             checkpoint weights validated on Cs to En dev set
English-Nepali language pair was reached with
just 600,000 sentences from each language.                  It is worth noting that for the model finetuned
                                                         on Cs - En, we observed a generally negative cor-
4.2    Transfer Learning                                 relation between transferring scores and finetun-
Zero shot transfer learning for mBART [16] works         ing scores, as well as a negative correlation be-
as following: pretrained mBART is finetuned on           tween the data size and transferring scores. In con-
a related language Y to target En and is directly        trast, for the model finetuned on Hi - En there is
applied for testing on source X to target language       a positive correlation between transferring scores
En.                                                      and finetuning scores and a positive correlation be-
   In order to investigate on how the size of train-     tween the data size and transferring scores. One
ing data will affect the zero-shot BLEU points, we       hypothesis is that even though mBART achieves

(a)                                                     (b)

Figure 1: Figures depicting the variation of translation quality with the availability of monolingual data
for English and Romanian

 Data Size    Transfer    Finetune                     4.3   Comparison between Back Translation
              Ne → En     Hi → En                            with Zero-Shot Transfer Learning
      0          0.0        0.0
    10K         11.7        12.8
   100K         14.0        18.7                       When does back translation work? In Table
   300K         13.6        19.1                       3, we do not have much increase in performance
   600K         13.2        19.3                       when we increase the data size for monolingual
   1.68M        15.1        23.4                       data when training on dissimilar language pairs
                                                       (Ne and En). If there exists high quality similar
Table 3: BLEU points for Transfer Learning
                                                       languages bi-text data, language transfer is able to
and directly finetuning of different data size, pre-
                                                       beat the conventional methods with BT. For sim-
trained mBART25 is finetuned on Hi to En and di-
                                                       ilar language pairs (Ro and En), we do see the
rectly transferred to Ne to En, we choose the best
                                                       gain in BLEU points when data size increases and
checkpoint weights validated on Hi to En dev set
                                                       can easily beat the zero-shot transfer learning with
                                                       100K monolingual data.

                                                          When does transfer learning work? Note that
                                                       in Table 2, the performance for zero-shot trans-
the highest transferring scores for Ro - En when       fer learning suffers from ”the curse of data size”,
finetuned on Cs - En and they both belong to Indo-     which means simply increasing the size of fine-
European language family, they are still vastly dif-   tuning data does not contribute to the performance
ferent languages using orthographic distance met-      of transferring scores, especially when finetuning
ric Heeringa et al. developed, with Czech be-          language (Cs) is not close enough to the target lan-
ing in the Slavic Language sub-group and Roma-         guage (Ro). Romanian is closest to Catalan [8],
nian being in the Romance language sub-group.          however, not so much bi-text exists between Cata-
Hence, finetuning on a larger data set would make      lan and English and it is also a out of vocabu-
mBART overfit towards the finetuned language,          lary language for mBART, thus does not benefit
thus hurts the performance of zero-shot translation    much from multilingual pretraining. On the con-
on source-target pair due to the limited capacity      trary, for languages in Indic Language family, in-
of mBART model itself. For Ne and Hi are more          crease in data size does help in both transferring
closely related to each other and have large over-     scores and direct finetuning scores. Compared
lap in their vocabularies, so finetuning on Hi will    with Back Translation, zero-shot transfer learning
greatly benefit the performance of zero-shot trans-    outperforms it with just a small amount of bi-text
lation.                                                for Hi to En.

Training Dataset like Cs → En in our experiment, the performance
Language Pair
WMT’16 Tatoeba(Wikipedia) for transfer learning can not be scaled up by sim-
Fr → En 32.8 27.38 ply increasing the size of finetuning data. In con-
De → En 34.3 27.44 trast, for the closely related language, like lan-
Ro → En 18.3 1.74 guages in Indic Language Family, we observe an
En → Fr 33.4 28.33 opposite trend. However, the capacity of mBART
En → De 26.4 21.93 model has not yet been well investigated, and we
En → Ro 18.9 0.39 still do not have a clear definition of the similar-
ity between transfer language and target language.
Table 4: Domain Mismatch These could be the potential topics conducted in
the future resesarch.
Talking about domain mismatch, we wish to ex-
4.4 Domain Dissimilarity plore if transfer learning is a better UMT tech-
We present the results of domain mismatch exper- nique for adapting a model to a different domain
iments in table 4. As expected there is a consistent in case of low resource languages. We also want to
decrease in the performance of the model across try more experiments to further solidify and gen-
all languages. The decrease is comparatively less eralize our findings. There have been numerous
for Fr and De then Ro. French and German are researches done in this field where attempts have
borh high resource languages and the XLM model been made to learn domain specific embeddings
for both these languages was pre trained as well or even model based changes to generalize across
as finetuned on 50K sentences. Whereas for Ro- mismatching domains. We hope to use some of the
manian only 1999 sentences were available in the existing methods and try to boost the performance
Tatoeba challenge. The size of the validation and of models specifically in case of low resource lan-
test data was also smaller. Using back transla- guages.
tion in this scenario did not prove to be ideal.
This also suggests that even for high resource lan- References
guages the domain of the dataset is a big factor [1] Naveen Arivazhagan, Ankur Bapna, Orhan
to determine the overall translation efficacy of the Firat, Roee Aharoni, Melvin Johnson, and
model. XLM uses shared sub word vocabulary Wolfgang Macherey. The missing ingredi-
which means there would be a lot of words which ent in zero-shot neural machine translation,
are differently used in different domains and are 2019.
thus harder for the model to predict. The results
[2] Mikel Artetxe, Sebastian Ruder, Dani Yo-
for Romanian could have been better if transfer
gatama, Gorka Labaka, and Eneko Agirre.
learning was used as the UMT technique. The
A call for more rigor in unsupervised cross-
results are particularly interesting as for low re-
lingual learning, 2020.
source languages like Romanian, getting parallel
domain specific data is very tough and adapting a [3] Yong Cheng. Semi-supervised learning for
pre-trained model does not help in the same way neural machine translation. In Joint Training
as it does in cases of high resource languages. We for Neural Machine Translation, pages 25–
wish to explore this further with some extra exper- 40. Springer, 2019.
iments as well. [4] Zi-Yi Dou, Junjie Hu, Antonios Anasta-
sopoulos, and Graham Neubig. Unsuper-
5 Conclusion and Future-Work vised domain adaptation for neural machine
translation with domain-aware feature em-
In this work, we compare two different solutions
beddings, 2019.
for low resource language MT: unsupervised MT
with BT and zero-shot transfer learning. We focus [5] Orhan Firat, Baskaran Sankaran, Yaser
on how the size affects the performance for both Al-Onaizan, Fatos T. Yarman Vural, and
unsupervised MT and transfer and how the domain Kyunghyun Cho. Zero-resource translation
matters in unsupervised MT. with multi-lingual neural machine transla-
For transfer learning, we find that if the model tion, 2016.
is transferred from a ”not so similar” language, [6] Caglar Gulcehre, Orhan Firat, Kelvin Xu,

Kyunghyun Cho, Loic Barrault, Huei-Chi [13] Anoop Kunchukuttan, Pratik Mehta, and
Lin, Fethi Bougares, Holger Schwenk, and Pushpak Bhattacharyya. The IIT bom-
Yoshua Bengio. On using monolingual cor- bay english-hindi parallel corpus. CoRR,
pora in neural machine translation. arXiv abs/1710.02855, 2017. URL http://
preprint arXiv:1503.03535, 2015. arxiv.org/abs/1710.02855.
[7] Francisco Guzmán, Peng-Jen Chen, Myle [14] Guillaume Lample and Alexis Conneau.
Ott, Juan Pino, Guillaume Lample, Cross-lingual language model pretraining,
Philipp Koehn, Vishrav Chaudhary, and 2019.
Marc’Aurelio Ranzato. Two new evaluation [15] Guillaume Lample, Alexis Conneau, Lu-
datasets for low-resource machine transla- dovic Denoyer, and Marc’Aurelio Ran-
tion: Nepali-english and sinhala-english. zato. Unsupervised machine translation us-
2019. ing monolingual corpora only. arXiv preprint
[8] Wilbert Heeringa, Jelena Golubovic, Char- arXiv:1711.00043, 2017.
lotte Gooskens, Anja Schüppert, Femke [16] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian
Swarte, and Stefanie Voigt. Lexical and or- Li, Sergey Edunov, Marjan Ghazvininejad,
thographic distances between Germanic, Ro- Mike Lewis, and Luke Zettlemoyer. Multi-
mance and Slavic languages and their rela- lingual denoising pre-training for neural ma-
tionship to geographic distance, pages 99– chine translation, 2020.
137. P.I.E. - Peter Lang, 2013. [17] Minh-Thang Luong and Christopher D Man-
[9] Melvin Johnson, Mike Schuster, Quoc V. Le, ning. Stanford neural machine translation
Maxim Krikun, Yonghui Wu, Zhifeng Chen, systems for spoken language domains. In
Nikhil Thorat, Fernanda Viégas, Martin Wat- Proceedings of the International Workshop
tenberg, Greg Corrado, Macduff Hughes, on Spoken Language Translation, pages 76–
and Jeffrey Dean. Google’s multilingual 79, 2015.
neural machine translation system: Enabling [18] Kelly Marchisio, Kevin Duh, and Philipp
zero-shot translation. Transactions of the Koehn. When does unsupervised machine
Association for Computational Linguistics, translation work?, 2020.
5:339–351, 2017. doi: 10.1162/tacl a [19] Toan Q. Nguyen and David Chiang. Trans-
00065. URL https://www.aclweb. fer learning across low-resource, related lan-
org/anthology/Q17-1024. guages for neural machine translation, 2017.
[10] Tom Kocmi and Ondřej Bojar. Trivial trans- [20] Rico Sennrich, Barry Haddow, and Alexan-
fer learning for low-resource neural ma- dra Birch. Improving neural machine trans-
chine translation. Proceedings of the Third lation models with monolingual data, 2016.
Conference on Machine Translation: Re- [21] Jörg Tiedemann. The tatoeba translation
search Papers, 2018. doi: 10.18653/v1/ challenge – realistic data sets for low re-
w18-6325. URL http://dx.doi.org/ source and multilingual mt, 2020.
10.18653/v1/W18-6325.
[22] Michał Ziemski, Marcin Junczys-Dowmunt,
[11] Philipp Koehn. Europarl: A Paral- and Bruno Pouliquen. The United Nations
lel Corpus for Statistical Machine Trans- parallel corpus v1.0. In Proceedings of the
lation. In Conference Proceedings: the Tenth International Conference on Language
tenth Machine Translation Summit, pages Resources and Evaluation (LREC’16), pages
79–86, Phuket, Thailand, 2005. AAMT, 3530–3534, Portorož, Slovenia, May 2016.
AAMT. URL http://mt-archive. European Language Resources Association
info/MTS-2005-Koehn.pdf. (ELRA). URL https://www.aclweb.
[12] Anoop Kunchukuttan. The IndicNLP org/anthology/L16-1561.
Library. https://github.com/ [23] Barret Zoph, Deniz Yuret, Jonathan May,
anoopkunchukuttan/indic_nlp_ and Kevin Knight. Transfer learning
library/blob/master/docs/ for low-resource neural machine transla-
indicnlp.pdf, 2020. tion. In Proceedings of the 2016 Confer-

ence on Empirical Methods in Natural Lan-
guage Processing, pages 1568–1575, Austin,
Texas, November 2016. Association for
Computational Linguistics. doi: 10.18653/
v1/D16-1163.       URL https://www.
aclweb.org/anthology/D16-1163.

You can also read