Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation

 
CONTINUE READING
Zero-Shot Language Transfer vs Iterative Back Translation for
                                                              Unsupervised Machine Translation

                                                                Aviral Joshi, Chengzhi Huang, Har Simrat Singh
                                                           (aviralj, chengzhh, harsimrs)@andrew.cmu.edu

                                                               Abstract                              3. Finally, we have some parallel data for
                                                                                                        X-Z and monolingual data for X and Z.
                                              This work focuses on comparing differ-                    In this scenario semi-supervised approaches
                                              ent solutions for machine translation on                  that leverage both monolingual and parallel
arXiv:2104.00106v1 [cs.CL] 31 Mar 2021

                                              low resource language pairs, namely, with                 data can outperform the 2 other approaches
                                              zero-shot transfer learning and unsuper-                  mentioned.
                                              vised machine translation. We discuss
                                              how the data size affects the performance          Here X represents the source language (the lan-
                                              of both unsupervised MT and transfer               guage to transfer from), Z represents the target
                                              learning. Additionally we also look at how         language (the language to transfer to) and Y repre-
                                              the domain of the data affects the result of       sents a pivot language (an intermediate language)
                                              unsupervised MT. The code to all the ex-           which is similar to the target language Z. In a low
                                              periments performed in this project are ac-        resource scenario there is little to no parallel data
                                              cessible on Github.                                between X and Z.
                                                                                                    The first 2 data availability scenarios mentioned
                                         1     Introduction                                      above are generally more challenging due to lack
                                                                                                 of presence of parallel data between source and
                                         Unsupervised Machine translation(UMT) is the            target languages and are explored in greater de-
                                         task of translating sentences from a source lan-        tail in this work. Futhermore, our report also dis-
                                         guage to a target language without the help of par-     cuss the impact of mismatch in dataset domains
                                         allel data for training. UMT is especially useful       for training and evaluation and its effect on trans-
                                         when translating to and from low resource lan-          lation. Our report is structured as follows. We
                                         guages for example English to Nepali or vice-           first introduce related work in the domain of unsu-
                                         versa. Since it is assumed that there is no access to   pervised machine translation with back translation
                                         parallel data between two languages, unsupervised       and language transfer while also discussing works
                                         MT techniques focus on leveraging monolingual           that detail the impact of domain mismatch in train-
                                         data to learn a translation model. More generally       ing and evaluation datasets for UMT in Section 2.
                                         there are three data availability scenarios that can    After which we describe the experimental setup in
                                         dictate the choice of the training procedure and the    Section 3. Moving on we describe the results of
                                         translation model to be used. These scenarios are       our experiments on low-resource language pairs
                                         as follows:                                             with varying training data sizes and also analyze
                                                                                                 the impact of out of domain datasets on the qual-
                                             1. We have no parallel data for X-Z but have a      ity of translation in Section 4. Finally, we con-
                                                large monolingual corpus. Unsupervised Ma-       clude with possible directions for future work in
                                                chine translation is very well suited for this   the domain of UMT for low-resource languages.
                                                scenario.
                                                                                                 2     Related Work
                                             2. We do not have parallel data for X-Z but
                                                we have parallel data for Y-Z and a good         2.1    Unsupervised Machine Translation Using
                                                amount of monolingual data. Zero-Shot lan-              Monolingual Data
                                                guage transfer can be more effective in this     Recent advancement in Machine Translation can
                                                scenario.                                        be attributed to the development of training tech-
niques that allow deep learning models to uti-          source and target language. To map the latent
lize large-scale monolingual data for improving         DAE representations of the two languages into
performance on translation while reducing de-           the same space. To aid translation the authors
pendency on massive amounts of parallel data.           also introduce an additional adversarial loss term.
Of the several attempts to leverage monolin-            The DAE along with iterative back-translation
gual data three techniques – back-translation, lan-     is utilized to train a completely unsupervised
guage modelling and Auto-Encoding – in particu-         translation system.
lar standout as the most effective and widely ac-
cepted strategies to do so.                             2.2   Transfer Learning - Zero shot
                                                              Translation
   Sennrich et. al. [20] first introduced the idea
of iteratively improving the quality of a machine       Even though UMT has shown to perform well on
translation model in a semi-supervised setting by       similar language pairs with good amount of mono-
using “back-translation”, the idea revolved around      lingual data, Artetxe et al. argued a scenario with-
training an initial sub-optimal translation system      out any parallel data and abundant monolingual
with the help of available parallel data and then       data is unrealistic in practice. One common alter-
utilizing it to translate many sentences from a large   native in this scenario is to avoid using monolin-
monolingual corpus on the target side to the source     gual data with zero-shot transfer learning. Zero-
language to generate augmented parallel-data for        shot translation concerns the case where direct
training the original translation system.               (src-tgt) parallel data is lacking but there is paral-
                                                        lel data between the target language and a similar
   Gulcehre et. al [6] demonstrated an effec-
                                                        language to source language.
tive strategy to integrate language modelling as
                                                           Transfer Learning is firstly proposed for NMT
an auxiliary task with Machine Translation im-
                                                        by Zoph et al., who leverage a high resourced
proving the performance on translation quality be-
                                                        parent model to initialize the low-resourced child
tween both high resource languages and low re-
                                                        model. Nguyen and Chiang and Kocmi and Bo-
source language pairs. A more recent work by
                                                        jar utilizes shared vocabularies for source and tar-
Lample. et. al [14] introduced a Cross Lingual
                                                        get language to improve transfer learning, in ad-
Model(XLM) that utilized a Cross Lingual Lan-
                                                        dition, mitigate the vocabulary mismatch by us-
guage modelling pretraining objective to improve
                                                        ing cross-lingual word embedding. These meth-
performance on translation.
                                                        ods show great performance in transfer learning
   Previous works have also looked at the task          for low resource languages, they show limited ef-
of translation from the perspective of an Auto-         fects on zero-shot transfer learning.
Encoder to utilize monolingual data. Cheng et.             Multilinguality has been extensively studied
al.[3] introduced an approach which utilized an         and has been applied to the related problem of
Auto-Encoder to reconstruct sentences in mono-          zero-shot translation (Johnson et al.; Firat et al.;
lingual data where the encoder translates input         Arivazhagan et al.). Liu et al. showed some initial
from source to target language and the decoder          results on multilingual unsupervised translation
works to translate sentences back to the source lan-    in the low-resource setting. Namely, Pretrained
guage. Parallel data is utilized in conjunction to      mBART25 is finetuned on transfer language to tar-
monolingual data to learn the translation system.       get language pair, and is directly to the source to
This method was shown to outperform the origi-          target pairs. mBART obtained up to 34.1 (Nl-En)
nal back-translation system proposed by Sennrich        with zero-shot translation and works well when
et. al.                                                 fine-tuning is also conducted in the same language
   Unlike the previously described semi-                family. Specifically, it acheives 23 BLEU points
supervised Lample et. al. [15] introduce an             for Ro-En using Cs as the transfer language, and
effective strategy to perform completely unsu-          18.9 BLUE points for Ne-En using Hi as the trans-
pervised machine translation in the presence            fer language.
of a large monolingual corpus. Their work
leverages a Denoising Auto-Encoder (DAE) to             2.3   Unsupervised Machine Translation on
reconstruct denoised sentences from the sentences             Out-of-Domain Datasets
in monolingual corpora which are systematically         Owing to increasing popularity of unsupervised
corrupted to train encoder and decoders for both        MT and the techniques involved, adaptation of
UMT to translate data from domains where a large              language pairs and also how variation in the
amount of parallel data is not obtainable is also             monolingual data size effects the translation
gaining importance. One of the main issues as-                quality for similar and dissimilar language
sociated with this is the feasibility of using pre-           pairs.
trained models. Most of the pre-trained mod-
els, even though trained under supervised learn-         2. Zero Shot Language Transfer: Our sec-
ing paradigm, fail to achieve significant transla-          ond experiement involves zero-shot transla-
tion accuracy on a domain which was non-existent            tion using mBART model. We finetune on re-
in the training data. Unsupervised models also fol-         lated language (X → En) and apply it directly
low the same pattern. Marchisio et al. [18] show            to the source-target (Y → En) language pair.
some significant results and comment on the do-             Here we illustrate how the size of finetun-
main similarity of test and training data in State of       ing data affect both finetuning scores (fine-
the Art models. The authors use United Nations              tuned on X → En and tested on X → En) and
Parallel Corpus (UN) [22], News Crawl and Com-              transferring scores (finetuned on X → En and
monCrawl datasets for their work. The model was             tested on Y → En).
trained solely on NewsCrawl while being evalu-           3. Domain Dissimilarity Our third experiment
ated on all 3 datasets. A general trend of reduc-           analyses the performance of a model which is
tion in performance was observed across various             trained on a different dataset and is tested on
language pairs. Extensive work has been done                sentences from an entirely different domain.
to address this problem. Dou et al. [4] propose             We use XML model and perform back trans-
a Domain-Aware feature embeddings based ap-                 lation on pre-trained models for all language
proach and achieve improvements over previous               paris X → En to fine tune it on the required
existing methods. Luong and Manning [17] adopt              dataset. Back translation steps include X →
an attention based approach to fine tune a pre-             En and En → X. The model is then tested for
trained model on out of domain data. Sennrich               the previously mentioned steps on a dataset
et al. [20] show that for out of domain data back           sourced from a completely different domain.
translation helps the models learn a whole lot of
words previously non existent in the vocabulary         3.2   Models
which could be the reason why out of domain back
                                                        mBART [16] is used for experiments with zero-
translation based UMT fails. Our experiments fol-
                                                        shot transfer learning, XLM [14] is used for ex-
low a similar line of work and show that fine tun-
                                                        periments with unsupervised machine translation.
ing a pre-trained translation model does not effec-
tively help with out of domain data.                     1. mBART is the first method for pre-training a
                                                            complete sequence to sequence model by de-
3     Experiments
                                                            noising full texts in multiple languages. Pre-
3.1    Proposed Experiments                                 training is done on CC25 data set and pre-
In our experiments, we focus on how the size of             trained model can be directly fine-tuned for
training data will affect the performance of both           both supervised and unsupervised MT. It also
zero-shot transfer learning and Unsupervised Ma-            enables a smooth zero-shot transfer to lan-
chine Translation with iterative back translation.          guage pairs with no bi-text data. In our ex-
Additionally, we conduct an extensive empirical             periments, we use mBART25, which is pre-
evaluation of unsupervised machine translation us-          trained on all 25 lagnauges. Ideally, for a fair
ing dissimilar domains for monolingual training             comparison unsupervised BT should be also
data between source and target languages and with           done using mBART25, however, the authors
diverse datasets between training data and test             did not provide their implementation for BT
data.                                                       and so we shift to the XLM model for exper-
                                                            iments with BT since its implementation was
    1. Unsupervised MT using iterative back                 more accessible.
       translation: Our first experiment involves
       UMT using back translation(BT) with the           2. XLM is a cross lingual language model
       XLM model. Here we demonstrate the fea-              developed for both supervised and unsu-
       sibility of UMT with BT for low-resource             pervised machine translation tasks. XLM
model uses shared sub word vocabulary and        cabulary. Romanian was chosen as a low re-
      is trained on 3 different language mod-          source language, distantly related to English. To
      elling objectives of Causal language model-      simulate domain mismatch, the XLM model was
      ing, Masked Language modeling and Trans-         pretrained on WMT’16 which mostly has paral-
      lation language modeling. The model is es-       lel data extracted from news articles. The model
      sentially Transformer based with 1024 sub        was then fine tuned on monolingual data of the
      units and 8 heads. Cross lingual XLM is          four languages obtained from the Tatoeba chal-
      trained on 15 languages and is shown to          lenge [21]. For French, English and German 50K
      achive SoTA on all the different languages.      monolingual sentences were used, and for Roma-
      Our motivation behind using XLM was that a       nian only 1999 sentences were available. Tatoeba
      pretrained XLM model was available for the       dataset has sentences with minimum overlap with
      languages of our choice. It supported 2 way      any other available machine translation dataset.
      backtranslation and was comparatively easier     We used the sentences extracted from Wikipedia
      to train and fine tune than most other multi-    pages present within the dataset. The validation
      lingual models.                                  and testing set was extracted from WMT’16.
3.3   Dataset                                          3.3.1   Experiment Details
Two language pairs were chosen for the experi-
                                                       We use the following tokenizers for both prepro-
ments where we conduct an investigation of the
                                                       cessing training data and evaluations for all three
impact of size of monolingual training data on
                                                       experiments:
translation quality between a similar and dissimi-
lar language pair (Ro (Romanian) - En (English)
                                                          • Ne, Hi: We use Indic-NLP Library to
and Ne (Nepali) - En respectively. For Ro &
                                                            tokenize both Indic language inputs and
En, data from wikipedia was used. We perform
                                                            outputs.[12]
random sampling to obtain monolingual sentences
from the monolingual data provided by the authors
of the XLM model (This was originally obtained            • Ro, Cs, En: We apply Moses tokenization
by scraping articles from Wikipedia). For Nepali            and special normalization for English, Czech
the monolingual data was obtained from the Flores           and Romanian texts.
data corpus.
   For zero-shot transfer learning, we conducted          We conduct all the experiments for trans-
experiments on Ro to En with Cs (Czech) be-            fer learning on AWS g4dn.xlarge instance,
ing the transfer language and Ne to En with Hi         with max-tokens 512, total-num-update
(Hindi) being the transfer language. Cs - En par-      80000 and dropout 0.3. During the inference time,
allel training data is taken from Europarl v7 [11]     beam size is set to 5.
with 668,595 sentence pairs, Ro - En test data is         Similar to experiments with transfer learning all
taken from WMT16 test set for Romanian and En-         experiments for Back Translation were conducted
glish, consisting of 1999 sentence pairs; Hi-En        on a g4dn.xlarge instance the batch size was set to
parallel training data is taken from IITB Bombay       32 and multiple iterations of back translation were
English-Hindi corpus[13] with 1,609,682 sentence       run. The number of tokens per batch were set to
pairs, Ne-En test data is taken from [7] with 2835     a maximum of 1500 and a embedding dimension
test sentences.                                        of 1024 was used. Other hyper-parameters were
   To analyse the effect of domain mismatch train-     kept the same as the implementation provided by
ing and testing, we use Fr (French), Genrman           the authors of XLM.
(De) and Romanian (Ro). In all the experiments            For domain mismatch experiment we use XLM
the target language is English (En). Back trans-       model as well. We follow the Byte Pair Encod-
lation is performed both ways. The choice for          ing technique as used by the authors of XLM, and
these specific languages was motivated by the cri-     use fastBPE for the same. Tokenization is done
terion on language similarity and availability of      using Moses tokenization. The experiments were
data. French and German are related to English,        conducted on AWS p3.2xlarge instances. The
albeit not closely but still possess a similarity in   model used 6 layers of transformers with 8 heads.
terms of language family and closeness of vo-          The embedding dimension was 1024.
3.4    Evaluation Metrics                                       BLEU - Back Translation
For all of our tasks, we use BLEU score calcu-            Data Size
                                                                 High Similarity Low Similarity
lated by sacrebleu as the automatic metric to                      en         ro   en       ne
evaluate the translation performance. Specifically,                →         ←     →        ←
we compute the BLEU scores over tokenized text              10K  2.76      3.93 0.0        0.0
for both hypothesis outputs and references as men-         100K  23.6      23.6 0.0        0.0
tioned in Liu et al.                                       300K  27.7      26.9 0.1        0.1
                                                           600K  28.8      28.3 0.1        0.2
4     Results and Discussions                              >10M  33.3      31.8 0.1        0.5
In this section we describe the results of the pro-      Table 1: Comparison of Translation Quality be-
posed experiments as mentioned previously in             tween Similar and Dissimilar language pairs and
Section 3 and discuss our finding to illustrate the      how they scale with increase in monolingual data.
the impact of dataset size and dataset domain for
Unsupervised Machine Translation and data size
for Transfer Learning.                                   randomly sample 1K, 10K, 100K and 300K sen-
                                                         tences from Europart v7 [11] of Cs - En parallel
4.1    Back Translation                                  training corpus and test the finetuning model di-
As expected with most deep learning approaches,          rectly on Ro - En test data; We also randomly sam-
the trained model scales with increase in train-         ple 10K, 100K, 300K and 600K sentences from
ing data (monolingual) as the BLEU show a gen-           ITTB Bombay-English corpus [13] and test the
eral upward trend as more data is made accessi-          finetuning model directly on Ne - En test data. We
ble to the XLM model. This is evident in the re-         present BLEU points of finetuning models on both
sults shown in Table 1 especially for the high sim-      source-target language pairs (X - En) and related-
ilarity English-Romanian language pair. Whereas,         target language pairs (Y - En) for better observing
the performance on low similarity English-Nepali         the correlation between them. Each experiment is
pair is degenerate in all data availability scenarios.   trained using the same hyper-parameters.
Even when the entire training corpus is provided
                                                                       Transfer    Finetune
to the model. This result indicates that it might not     Data Size    Ro → En     Cs → En
be feasible to perform unsupervised MT when lan-              0          0.0          0.0
guage pairs have little or no similarity using Back          1K          17.6        16.2
Translation along. In which case other approaches           10K          23.3        21.9
for Unsupervised MT such as language transfer               100K         23.0        25.7
are better suited. More about language transfer is          300K         20.3        26.5
discussed in Section 4.2.                                   668K         14.5        27.5
   A big improvement in BLEU score is seen
when training data size is increased from 10,000         Table 2: BLEU points for Transfer Learning
to 100,000 for the high similarity language pair.        and directly finetuning of different data size, pre-
The performance increases steadily as more data          trained mBART25 is finetuned on Cs to En and di-
is provided to the model while the rate of in-           rectly transferred to Ro to En, we choose the best
crease slows down. A reasonable BLEU for the             checkpoint weights validated on Cs to En dev set
English-Nepali language pair was reached with
just 600,000 sentences from each language.                  It is worth noting that for the model finetuned
                                                         on Cs - En, we observed a generally negative cor-
4.2    Transfer Learning                                 relation between transferring scores and finetun-
Zero shot transfer learning for mBART [16] works         ing scores, as well as a negative correlation be-
as following: pretrained mBART is finetuned on           tween the data size and transferring scores. In con-
a related language Y to target En and is directly        trast, for the model finetuned on Hi - En there is
applied for testing on source X to target language       a positive correlation between transferring scores
En.                                                      and finetuning scores and a positive correlation be-
   In order to investigate on how the size of train-     tween the data size and transferring scores. One
ing data will affect the zero-shot BLEU points, we       hypothesis is that even though mBART achieves
(a)                                                     (b)

Figure 1: Figures depicting the variation of translation quality with the availability of monolingual data
for English and Romanian

 Data Size    Transfer    Finetune                     4.3   Comparison between Back Translation
              Ne → En     Hi → En                            with Zero-Shot Transfer Learning
      0          0.0        0.0
    10K         11.7        12.8
   100K         14.0        18.7                       When does back translation work? In Table
   300K         13.6        19.1                       3, we do not have much increase in performance
   600K         13.2        19.3                       when we increase the data size for monolingual
   1.68M        15.1        23.4                       data when training on dissimilar language pairs
                                                       (Ne and En). If there exists high quality similar
Table 3: BLEU points for Transfer Learning
                                                       languages bi-text data, language transfer is able to
and directly finetuning of different data size, pre-
                                                       beat the conventional methods with BT. For sim-
trained mBART25 is finetuned on Hi to En and di-
                                                       ilar language pairs (Ro and En), we do see the
rectly transferred to Ne to En, we choose the best
                                                       gain in BLEU points when data size increases and
checkpoint weights validated on Hi to En dev set
                                                       can easily beat the zero-shot transfer learning with
                                                       100K monolingual data.

                                                          When does transfer learning work? Note that
                                                       in Table 2, the performance for zero-shot trans-
the highest transferring scores for Ro - En when       fer learning suffers from ”the curse of data size”,
finetuned on Cs - En and they both belong to Indo-     which means simply increasing the size of fine-
European language family, they are still vastly dif-   tuning data does not contribute to the performance
ferent languages using orthographic distance met-      of transferring scores, especially when finetuning
ric Heeringa et al. developed, with Czech be-          language (Cs) is not close enough to the target lan-
ing in the Slavic Language sub-group and Roma-         guage (Ro). Romanian is closest to Catalan [8],
nian being in the Romance language sub-group.          however, not so much bi-text exists between Cata-
Hence, finetuning on a larger data set would make      lan and English and it is also a out of vocabu-
mBART overfit towards the finetuned language,          lary language for mBART, thus does not benefit
thus hurts the performance of zero-shot translation    much from multilingual pretraining. On the con-
on source-target pair due to the limited capacity      trary, for languages in Indic Language family, in-
of mBART model itself. For Ne and Hi are more          crease in data size does help in both transferring
closely related to each other and have large over-     scores and direct finetuning scores. Compared
lap in their vocabularies, so finetuning on Hi will    with Back Translation, zero-shot transfer learning
greatly benefit the performance of zero-shot trans-    outperforms it with just a small amount of bi-text
lation.                                                for Hi to En.
Training Dataset             like Cs → En in our experiment, the performance
    Language Pair
                    WMT’16 Tatoeba(Wikipedia)          for transfer learning can not be scaled up by sim-
      Fr → En        32.8            27.38             ply increasing the size of finetuning data. In con-
      De → En        34.3            27.44             trast, for the closely related language, like lan-
      Ro → En        18.3            1.74              guages in Indic Language Family, we observe an
      En → Fr        33.4            28.33             opposite trend. However, the capacity of mBART
      En → De        26.4            21.93             model has not yet been well investigated, and we
      En → Ro        18.9            0.39              still do not have a clear definition of the similar-
                                                       ity between transfer language and target language.
            Table 4: Domain Mismatch                   These could be the potential topics conducted in
                                                       the future resesarch.
                                                          Talking about domain mismatch, we wish to ex-
4.4    Domain Dissimilarity                            plore if transfer learning is a better UMT tech-
We present the results of domain mismatch exper-       nique for adapting a model to a different domain
iments in table 4. As expected there is a consistent   in case of low resource languages. We also want to
decrease in the performance of the model across        try more experiments to further solidify and gen-
all languages. The decrease is comparatively less      eralize our findings. There have been numerous
for Fr and De then Ro. French and German are           researches done in this field where attempts have
borh high resource languages and the XLM model         been made to learn domain specific embeddings
for both these languages was pre trained as well       or even model based changes to generalize across
as finetuned on 50K sentences. Whereas for Ro-         mismatching domains. We hope to use some of the
manian only 1999 sentences were available in the       existing methods and try to boost the performance
Tatoeba challenge. The size of the validation and      of models specifically in case of low resource lan-
test data was also smaller. Using back transla-        guages.
tion in this scenario did not prove to be ideal.
This also suggests that even for high resource lan-    References
guages the domain of the dataset is a big factor       [1] Naveen Arivazhagan, Ankur Bapna, Orhan
to determine the overall translation efficacy of the       Firat, Roee Aharoni, Melvin Johnson, and
model. XLM uses shared sub word vocabulary                 Wolfgang Macherey. The missing ingredi-
which means there would be a lot of words which            ent in zero-shot neural machine translation,
are differently used in different domains and are          2019.
thus harder for the model to predict. The results
                                                       [2] Mikel Artetxe, Sebastian Ruder, Dani Yo-
for Romanian could have been better if transfer
                                                           gatama, Gorka Labaka, and Eneko Agirre.
learning was used as the UMT technique. The
                                                           A call for more rigor in unsupervised cross-
results are particularly interesting as for low re-
                                                           lingual learning, 2020.
source languages like Romanian, getting parallel
domain specific data is very tough and adapting a      [3] Yong Cheng. Semi-supervised learning for
pre-trained model does not help in the same way            neural machine translation. In Joint Training
as it does in cases of high resource languages. We         for Neural Machine Translation, pages 25–
wish to explore this further with some extra exper-        40. Springer, 2019.
iments as well.                                        [4] Zi-Yi Dou, Junjie Hu, Antonios Anasta-
                                                           sopoulos, and Graham Neubig. Unsuper-
5     Conclusion and Future-Work                           vised domain adaptation for neural machine
                                                           translation with domain-aware feature em-
In this work, we compare two different solutions
                                                           beddings, 2019.
for low resource language MT: unsupervised MT
with BT and zero-shot transfer learning. We focus      [5] Orhan Firat, Baskaran Sankaran, Yaser
on how the size affects the performance for both           Al-Onaizan, Fatos T. Yarman Vural, and
unsupervised MT and transfer and how the domain            Kyunghyun Cho. Zero-resource translation
matters in unsupervised MT.                                with multi-lingual neural machine transla-
   For transfer learning, we find that if the model        tion, 2016.
is transferred from a ”not so similar” language,       [6] Caglar Gulcehre, Orhan Firat, Kelvin Xu,
Kyunghyun Cho, Loic Barrault, Huei-Chi        [13] Anoop Kunchukuttan, Pratik Mehta, and
     Lin, Fethi Bougares, Holger Schwenk, and           Pushpak Bhattacharyya.      The IIT bom-
     Yoshua Bengio. On using monolingual cor-           bay english-hindi parallel corpus. CoRR,
     pora in neural machine translation. arXiv          abs/1710.02855, 2017.      URL http://
     preprint arXiv:1503.03535, 2015.                   arxiv.org/abs/1710.02855.
[7] Francisco Guzmán, Peng-Jen Chen, Myle         [14] Guillaume Lample and Alexis Conneau.
    Ott, Juan Pino, Guillaume Lample,                   Cross-lingual language model pretraining,
    Philipp Koehn, Vishrav Chaudhary, and               2019.
    Marc’Aurelio Ranzato. Two new evaluation       [15] Guillaume Lample, Alexis Conneau, Lu-
    datasets for low-resource machine transla-          dovic Denoyer, and Marc’Aurelio Ran-
    tion: Nepali-english and sinhala-english.           zato. Unsupervised machine translation us-
    2019.                                               ing monolingual corpora only. arXiv preprint
[8] Wilbert Heeringa, Jelena Golubovic, Char-           arXiv:1711.00043, 2017.
    lotte Gooskens, Anja Schüppert, Femke         [16] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian
    Swarte, and Stefanie Voigt. Lexical and or-         Li, Sergey Edunov, Marjan Ghazvininejad,
    thographic distances between Germanic, Ro-          Mike Lewis, and Luke Zettlemoyer. Multi-
    mance and Slavic languages and their rela-          lingual denoising pre-training for neural ma-
    tionship to geographic distance, pages 99–          chine translation, 2020.
    137. P.I.E. - Peter Lang, 2013.                [17] Minh-Thang Luong and Christopher D Man-
[9] Melvin Johnson, Mike Schuster, Quoc V. Le,          ning. Stanford neural machine translation
    Maxim Krikun, Yonghui Wu, Zhifeng Chen,             systems for spoken language domains. In
    Nikhil Thorat, Fernanda Viégas, Martin Wat-        Proceedings of the International Workshop
    tenberg, Greg Corrado, Macduff Hughes,              on Spoken Language Translation, pages 76–
    and Jeffrey Dean. Google’s multilingual             79, 2015.
    neural machine translation system: Enabling    [18] Kelly Marchisio, Kevin Duh, and Philipp
    zero-shot translation. Transactions of the          Koehn. When does unsupervised machine
    Association for Computational Linguistics,          translation work?, 2020.
    5:339–351, 2017.      doi: 10.1162/tacl a      [19] Toan Q. Nguyen and David Chiang. Trans-
    00065. URL https://www.aclweb.                      fer learning across low-resource, related lan-
    org/anthology/Q17-1024.                             guages for neural machine translation, 2017.
[10] Tom Kocmi and Ondřej Bojar. Trivial trans-   [20] Rico Sennrich, Barry Haddow, and Alexan-
     fer learning for low-resource neural ma-           dra Birch. Improving neural machine trans-
     chine translation. Proceedings of the Third        lation models with monolingual data, 2016.
     Conference on Machine Translation: Re-        [21] Jörg Tiedemann. The tatoeba translation
     search Papers, 2018. doi: 10.18653/v1/             challenge – realistic data sets for low re-
     w18-6325. URL http://dx.doi.org/                   source and multilingual mt, 2020.
     10.18653/v1/W18-6325.
                                                   [22] Michał Ziemski, Marcin Junczys-Dowmunt,
[11] Philipp Koehn.      Europarl: A Paral-             and Bruno Pouliquen. The United Nations
     lel Corpus for Statistical Machine Trans-          parallel corpus v1.0. In Proceedings of the
     lation. In Conference Proceedings: the             Tenth International Conference on Language
     tenth Machine Translation Summit, pages            Resources and Evaluation (LREC’16), pages
     79–86, Phuket, Thailand, 2005. AAMT,               3530–3534, Portorož, Slovenia, May 2016.
     AAMT. URL http://mt-archive.                       European Language Resources Association
     info/MTS-2005-Koehn.pdf.                           (ELRA). URL https://www.aclweb.
[12] Anoop Kunchukuttan. The IndicNLP                   org/anthology/L16-1561.
     Library.    https://github.com/               [23] Barret Zoph, Deniz Yuret, Jonathan May,
     anoopkunchukuttan/indic_nlp_                       and Kevin Knight.       Transfer learning
     library/blob/master/docs/                          for low-resource neural machine transla-
     indicnlp.pdf, 2020.                                tion. In Proceedings of the 2016 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing, pages 1568–1575, Austin,
Texas, November 2016. Association for
Computational Linguistics. doi: 10.18653/
v1/D16-1163.       URL https://www.
aclweb.org/anthology/D16-1163.
You can also read