Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation Aviral Joshi, Chengzhi Huang, Har Simrat Singh (aviralj, chengzhh, harsimrs)@andrew.cmu.edu Abstract 3. Finally, we have some parallel data for X-Z and monolingual data for X and Z. This work focuses on comparing differ- In this scenario semi-supervised approaches ent solutions for machine translation on that leverage both monolingual and parallel arXiv:2104.00106v1 [cs.CL] 31 Mar 2021 low resource language pairs, namely, with data can outperform the 2 other approaches zero-shot transfer learning and unsuper- mentioned. vised machine translation. We discuss how the data size affects the performance Here X represents the source language (the lan- of both unsupervised MT and transfer guage to transfer from), Z represents the target learning. Additionally we also look at how language (the language to transfer to) and Y repre- the domain of the data affects the result of sents a pivot language (an intermediate language) unsupervised MT. The code to all the ex- which is similar to the target language Z. In a low periments performed in this project are ac- resource scenario there is little to no parallel data cessible on Github. between X and Z. The first 2 data availability scenarios mentioned 1 Introduction above are generally more challenging due to lack of presence of parallel data between source and Unsupervised Machine translation(UMT) is the target languages and are explored in greater de- task of translating sentences from a source lan- tail in this work. Futhermore, our report also dis- guage to a target language without the help of par- cuss the impact of mismatch in dataset domains allel data for training. UMT is especially useful for training and evaluation and its effect on trans- when translating to and from low resource lan- lation. Our report is structured as follows. We guages for example English to Nepali or vice- first introduce related work in the domain of unsu- versa. Since it is assumed that there is no access to pervised machine translation with back translation parallel data between two languages, unsupervised and language transfer while also discussing works MT techniques focus on leveraging monolingual that detail the impact of domain mismatch in train- data to learn a translation model. More generally ing and evaluation datasets for UMT in Section 2. there are three data availability scenarios that can After which we describe the experimental setup in dictate the choice of the training procedure and the Section 3. Moving on we describe the results of translation model to be used. These scenarios are our experiments on low-resource language pairs as follows: with varying training data sizes and also analyze the impact of out of domain datasets on the qual- 1. We have no parallel data for X-Z but have a ity of translation in Section 4. Finally, we con- large monolingual corpus. Unsupervised Ma- clude with possible directions for future work in chine translation is very well suited for this the domain of UMT for low-resource languages. scenario. 2 Related Work 2. We do not have parallel data for X-Z but we have parallel data for Y-Z and a good 2.1 Unsupervised Machine Translation Using amount of monolingual data. Zero-Shot lan- Monolingual Data guage transfer can be more effective in this Recent advancement in Machine Translation can scenario. be attributed to the development of training tech-
niques that allow deep learning models to uti- source and target language. To map the latent lize large-scale monolingual data for improving DAE representations of the two languages into performance on translation while reducing de- the same space. To aid translation the authors pendency on massive amounts of parallel data. also introduce an additional adversarial loss term. Of the several attempts to leverage monolin- The DAE along with iterative back-translation gual data three techniques – back-translation, lan- is utilized to train a completely unsupervised guage modelling and Auto-Encoding – in particu- translation system. lar standout as the most effective and widely ac- cepted strategies to do so. 2.2 Transfer Learning - Zero shot Translation Sennrich et. al. [20] first introduced the idea of iteratively improving the quality of a machine Even though UMT has shown to perform well on translation model in a semi-supervised setting by similar language pairs with good amount of mono- using “back-translation”, the idea revolved around lingual data, Artetxe et al. argued a scenario with- training an initial sub-optimal translation system out any parallel data and abundant monolingual with the help of available parallel data and then data is unrealistic in practice. One common alter- utilizing it to translate many sentences from a large native in this scenario is to avoid using monolin- monolingual corpus on the target side to the source gual data with zero-shot transfer learning. Zero- language to generate augmented parallel-data for shot translation concerns the case where direct training the original translation system. (src-tgt) parallel data is lacking but there is paral- lel data between the target language and a similar Gulcehre et. al [6] demonstrated an effec- language to source language. tive strategy to integrate language modelling as Transfer Learning is firstly proposed for NMT an auxiliary task with Machine Translation im- by Zoph et al., who leverage a high resourced proving the performance on translation quality be- parent model to initialize the low-resourced child tween both high resource languages and low re- model. Nguyen and Chiang and Kocmi and Bo- source language pairs. A more recent work by jar utilizes shared vocabularies for source and tar- Lample. et. al [14] introduced a Cross Lingual get language to improve transfer learning, in ad- Model(XLM) that utilized a Cross Lingual Lan- dition, mitigate the vocabulary mismatch by us- guage modelling pretraining objective to improve ing cross-lingual word embedding. These meth- performance on translation. ods show great performance in transfer learning Previous works have also looked at the task for low resource languages, they show limited ef- of translation from the perspective of an Auto- fects on zero-shot transfer learning. Encoder to utilize monolingual data. Cheng et. Multilinguality has been extensively studied al.[3] introduced an approach which utilized an and has been applied to the related problem of Auto-Encoder to reconstruct sentences in mono- zero-shot translation (Johnson et al.; Firat et al.; lingual data where the encoder translates input Arivazhagan et al.). Liu et al. showed some initial from source to target language and the decoder results on multilingual unsupervised translation works to translate sentences back to the source lan- in the low-resource setting. Namely, Pretrained guage. Parallel data is utilized in conjunction to mBART25 is finetuned on transfer language to tar- monolingual data to learn the translation system. get language pair, and is directly to the source to This method was shown to outperform the origi- target pairs. mBART obtained up to 34.1 (Nl-En) nal back-translation system proposed by Sennrich with zero-shot translation and works well when et. al. fine-tuning is also conducted in the same language Unlike the previously described semi- family. Specifically, it acheives 23 BLEU points supervised Lample et. al. [15] introduce an for Ro-En using Cs as the transfer language, and effective strategy to perform completely unsu- 18.9 BLUE points for Ne-En using Hi as the trans- pervised machine translation in the presence fer language. of a large monolingual corpus. Their work leverages a Denoising Auto-Encoder (DAE) to 2.3 Unsupervised Machine Translation on reconstruct denoised sentences from the sentences Out-of-Domain Datasets in monolingual corpora which are systematically Owing to increasing popularity of unsupervised corrupted to train encoder and decoders for both MT and the techniques involved, adaptation of
UMT to translate data from domains where a large language pairs and also how variation in the amount of parallel data is not obtainable is also monolingual data size effects the translation gaining importance. One of the main issues as- quality for similar and dissimilar language sociated with this is the feasibility of using pre- pairs. trained models. Most of the pre-trained mod- els, even though trained under supervised learn- 2. Zero Shot Language Transfer: Our sec- ing paradigm, fail to achieve significant transla- ond experiement involves zero-shot transla- tion accuracy on a domain which was non-existent tion using mBART model. We finetune on re- in the training data. Unsupervised models also fol- lated language (X → En) and apply it directly low the same pattern. Marchisio et al. [18] show to the source-target (Y → En) language pair. some significant results and comment on the do- Here we illustrate how the size of finetun- main similarity of test and training data in State of ing data affect both finetuning scores (fine- the Art models. The authors use United Nations tuned on X → En and tested on X → En) and Parallel Corpus (UN) [22], News Crawl and Com- transferring scores (finetuned on X → En and monCrawl datasets for their work. The model was tested on Y → En). trained solely on NewsCrawl while being evalu- 3. Domain Dissimilarity Our third experiment ated on all 3 datasets. A general trend of reduc- analyses the performance of a model which is tion in performance was observed across various trained on a different dataset and is tested on language pairs. Extensive work has been done sentences from an entirely different domain. to address this problem. Dou et al. [4] propose We use XML model and perform back trans- a Domain-Aware feature embeddings based ap- lation on pre-trained models for all language proach and achieve improvements over previous paris X → En to fine tune it on the required existing methods. Luong and Manning [17] adopt dataset. Back translation steps include X → an attention based approach to fine tune a pre- En and En → X. The model is then tested for trained model on out of domain data. Sennrich the previously mentioned steps on a dataset et al. [20] show that for out of domain data back sourced from a completely different domain. translation helps the models learn a whole lot of words previously non existent in the vocabulary 3.2 Models which could be the reason why out of domain back mBART [16] is used for experiments with zero- translation based UMT fails. Our experiments fol- shot transfer learning, XLM [14] is used for ex- low a similar line of work and show that fine tun- periments with unsupervised machine translation. ing a pre-trained translation model does not effec- tively help with out of domain data. 1. mBART is the first method for pre-training a complete sequence to sequence model by de- 3 Experiments noising full texts in multiple languages. Pre- 3.1 Proposed Experiments training is done on CC25 data set and pre- In our experiments, we focus on how the size of trained model can be directly fine-tuned for training data will affect the performance of both both supervised and unsupervised MT. It also zero-shot transfer learning and Unsupervised Ma- enables a smooth zero-shot transfer to lan- chine Translation with iterative back translation. guage pairs with no bi-text data. In our ex- Additionally, we conduct an extensive empirical periments, we use mBART25, which is pre- evaluation of unsupervised machine translation us- trained on all 25 lagnauges. Ideally, for a fair ing dissimilar domains for monolingual training comparison unsupervised BT should be also data between source and target languages and with done using mBART25, however, the authors diverse datasets between training data and test did not provide their implementation for BT data. and so we shift to the XLM model for exper- iments with BT since its implementation was 1. Unsupervised MT using iterative back more accessible. translation: Our first experiment involves UMT using back translation(BT) with the 2. XLM is a cross lingual language model XLM model. Here we demonstrate the fea- developed for both supervised and unsu- sibility of UMT with BT for low-resource pervised machine translation tasks. XLM
model uses shared sub word vocabulary and cabulary. Romanian was chosen as a low re- is trained on 3 different language mod- source language, distantly related to English. To elling objectives of Causal language model- simulate domain mismatch, the XLM model was ing, Masked Language modeling and Trans- pretrained on WMT’16 which mostly has paral- lation language modeling. The model is es- lel data extracted from news articles. The model sentially Transformer based with 1024 sub was then fine tuned on monolingual data of the units and 8 heads. Cross lingual XLM is four languages obtained from the Tatoeba chal- trained on 15 languages and is shown to lenge [21]. For French, English and German 50K achive SoTA on all the different languages. monolingual sentences were used, and for Roma- Our motivation behind using XLM was that a nian only 1999 sentences were available. Tatoeba pretrained XLM model was available for the dataset has sentences with minimum overlap with languages of our choice. It supported 2 way any other available machine translation dataset. backtranslation and was comparatively easier We used the sentences extracted from Wikipedia to train and fine tune than most other multi- pages present within the dataset. The validation lingual models. and testing set was extracted from WMT’16. 3.3 Dataset 3.3.1 Experiment Details Two language pairs were chosen for the experi- We use the following tokenizers for both prepro- ments where we conduct an investigation of the cessing training data and evaluations for all three impact of size of monolingual training data on experiments: translation quality between a similar and dissimi- lar language pair (Ro (Romanian) - En (English) • Ne, Hi: We use Indic-NLP Library to and Ne (Nepali) - En respectively. For Ro & tokenize both Indic language inputs and En, data from wikipedia was used. We perform outputs.[12] random sampling to obtain monolingual sentences from the monolingual data provided by the authors of the XLM model (This was originally obtained • Ro, Cs, En: We apply Moses tokenization by scraping articles from Wikipedia). For Nepali and special normalization for English, Czech the monolingual data was obtained from the Flores and Romanian texts. data corpus. For zero-shot transfer learning, we conducted We conduct all the experiments for trans- experiments on Ro to En with Cs (Czech) be- fer learning on AWS g4dn.xlarge instance, ing the transfer language and Ne to En with Hi with max-tokens 512, total-num-update (Hindi) being the transfer language. Cs - En par- 80000 and dropout 0.3. During the inference time, allel training data is taken from Europarl v7 [11] beam size is set to 5. with 668,595 sentence pairs, Ro - En test data is Similar to experiments with transfer learning all taken from WMT16 test set for Romanian and En- experiments for Back Translation were conducted glish, consisting of 1999 sentence pairs; Hi-En on a g4dn.xlarge instance the batch size was set to parallel training data is taken from IITB Bombay 32 and multiple iterations of back translation were English-Hindi corpus[13] with 1,609,682 sentence run. The number of tokens per batch were set to pairs, Ne-En test data is taken from [7] with 2835 a maximum of 1500 and a embedding dimension test sentences. of 1024 was used. Other hyper-parameters were To analyse the effect of domain mismatch train- kept the same as the implementation provided by ing and testing, we use Fr (French), Genrman the authors of XLM. (De) and Romanian (Ro). In all the experiments For domain mismatch experiment we use XLM the target language is English (En). Back trans- model as well. We follow the Byte Pair Encod- lation is performed both ways. The choice for ing technique as used by the authors of XLM, and these specific languages was motivated by the cri- use fastBPE for the same. Tokenization is done terion on language similarity and availability of using Moses tokenization. The experiments were data. French and German are related to English, conducted on AWS p3.2xlarge instances. The albeit not closely but still possess a similarity in model used 6 layers of transformers with 8 heads. terms of language family and closeness of vo- The embedding dimension was 1024.
3.4 Evaluation Metrics BLEU - Back Translation For all of our tasks, we use BLEU score calcu- Data Size High Similarity Low Similarity lated by sacrebleu as the automatic metric to en ro en ne evaluate the translation performance. Specifically, → ← → ← we compute the BLEU scores over tokenized text 10K 2.76 3.93 0.0 0.0 for both hypothesis outputs and references as men- 100K 23.6 23.6 0.0 0.0 tioned in Liu et al. 300K 27.7 26.9 0.1 0.1 600K 28.8 28.3 0.1 0.2 4 Results and Discussions >10M 33.3 31.8 0.1 0.5 In this section we describe the results of the pro- Table 1: Comparison of Translation Quality be- posed experiments as mentioned previously in tween Similar and Dissimilar language pairs and Section 3 and discuss our finding to illustrate the how they scale with increase in monolingual data. the impact of dataset size and dataset domain for Unsupervised Machine Translation and data size for Transfer Learning. randomly sample 1K, 10K, 100K and 300K sen- tences from Europart v7 [11] of Cs - En parallel 4.1 Back Translation training corpus and test the finetuning model di- As expected with most deep learning approaches, rectly on Ro - En test data; We also randomly sam- the trained model scales with increase in train- ple 10K, 100K, 300K and 600K sentences from ing data (monolingual) as the BLEU show a gen- ITTB Bombay-English corpus [13] and test the eral upward trend as more data is made accessi- finetuning model directly on Ne - En test data. We ble to the XLM model. This is evident in the re- present BLEU points of finetuning models on both sults shown in Table 1 especially for the high sim- source-target language pairs (X - En) and related- ilarity English-Romanian language pair. Whereas, target language pairs (Y - En) for better observing the performance on low similarity English-Nepali the correlation between them. Each experiment is pair is degenerate in all data availability scenarios. trained using the same hyper-parameters. Even when the entire training corpus is provided Transfer Finetune to the model. This result indicates that it might not Data Size Ro → En Cs → En be feasible to perform unsupervised MT when lan- 0 0.0 0.0 guage pairs have little or no similarity using Back 1K 17.6 16.2 Translation along. In which case other approaches 10K 23.3 21.9 for Unsupervised MT such as language transfer 100K 23.0 25.7 are better suited. More about language transfer is 300K 20.3 26.5 discussed in Section 4.2. 668K 14.5 27.5 A big improvement in BLEU score is seen when training data size is increased from 10,000 Table 2: BLEU points for Transfer Learning to 100,000 for the high similarity language pair. and directly finetuning of different data size, pre- The performance increases steadily as more data trained mBART25 is finetuned on Cs to En and di- is provided to the model while the rate of in- rectly transferred to Ro to En, we choose the best crease slows down. A reasonable BLEU for the checkpoint weights validated on Cs to En dev set English-Nepali language pair was reached with just 600,000 sentences from each language. It is worth noting that for the model finetuned on Cs - En, we observed a generally negative cor- 4.2 Transfer Learning relation between transferring scores and finetun- Zero shot transfer learning for mBART [16] works ing scores, as well as a negative correlation be- as following: pretrained mBART is finetuned on tween the data size and transferring scores. In con- a related language Y to target En and is directly trast, for the model finetuned on Hi - En there is applied for testing on source X to target language a positive correlation between transferring scores En. and finetuning scores and a positive correlation be- In order to investigate on how the size of train- tween the data size and transferring scores. One ing data will affect the zero-shot BLEU points, we hypothesis is that even though mBART achieves
(a) (b) Figure 1: Figures depicting the variation of translation quality with the availability of monolingual data for English and Romanian Data Size Transfer Finetune 4.3 Comparison between Back Translation Ne → En Hi → En with Zero-Shot Transfer Learning 0 0.0 0.0 10K 11.7 12.8 100K 14.0 18.7 When does back translation work? In Table 300K 13.6 19.1 3, we do not have much increase in performance 600K 13.2 19.3 when we increase the data size for monolingual 1.68M 15.1 23.4 data when training on dissimilar language pairs (Ne and En). If there exists high quality similar Table 3: BLEU points for Transfer Learning languages bi-text data, language transfer is able to and directly finetuning of different data size, pre- beat the conventional methods with BT. For sim- trained mBART25 is finetuned on Hi to En and di- ilar language pairs (Ro and En), we do see the rectly transferred to Ne to En, we choose the best gain in BLEU points when data size increases and checkpoint weights validated on Hi to En dev set can easily beat the zero-shot transfer learning with 100K monolingual data. When does transfer learning work? Note that in Table 2, the performance for zero-shot trans- the highest transferring scores for Ro - En when fer learning suffers from ”the curse of data size”, finetuned on Cs - En and they both belong to Indo- which means simply increasing the size of fine- European language family, they are still vastly dif- tuning data does not contribute to the performance ferent languages using orthographic distance met- of transferring scores, especially when finetuning ric Heeringa et al. developed, with Czech be- language (Cs) is not close enough to the target lan- ing in the Slavic Language sub-group and Roma- guage (Ro). Romanian is closest to Catalan [8], nian being in the Romance language sub-group. however, not so much bi-text exists between Cata- Hence, finetuning on a larger data set would make lan and English and it is also a out of vocabu- mBART overfit towards the finetuned language, lary language for mBART, thus does not benefit thus hurts the performance of zero-shot translation much from multilingual pretraining. On the con- on source-target pair due to the limited capacity trary, for languages in Indic Language family, in- of mBART model itself. For Ne and Hi are more crease in data size does help in both transferring closely related to each other and have large over- scores and direct finetuning scores. Compared lap in their vocabularies, so finetuning on Hi will with Back Translation, zero-shot transfer learning greatly benefit the performance of zero-shot trans- outperforms it with just a small amount of bi-text lation. for Hi to En.
Training Dataset like Cs → En in our experiment, the performance Language Pair WMT’16 Tatoeba(Wikipedia) for transfer learning can not be scaled up by sim- Fr → En 32.8 27.38 ply increasing the size of finetuning data. In con- De → En 34.3 27.44 trast, for the closely related language, like lan- Ro → En 18.3 1.74 guages in Indic Language Family, we observe an En → Fr 33.4 28.33 opposite trend. However, the capacity of mBART En → De 26.4 21.93 model has not yet been well investigated, and we En → Ro 18.9 0.39 still do not have a clear definition of the similar- ity between transfer language and target language. Table 4: Domain Mismatch These could be the potential topics conducted in the future resesarch. Talking about domain mismatch, we wish to ex- 4.4 Domain Dissimilarity plore if transfer learning is a better UMT tech- We present the results of domain mismatch exper- nique for adapting a model to a different domain iments in table 4. As expected there is a consistent in case of low resource languages. We also want to decrease in the performance of the model across try more experiments to further solidify and gen- all languages. The decrease is comparatively less eralize our findings. There have been numerous for Fr and De then Ro. French and German are researches done in this field where attempts have borh high resource languages and the XLM model been made to learn domain specific embeddings for both these languages was pre trained as well or even model based changes to generalize across as finetuned on 50K sentences. Whereas for Ro- mismatching domains. We hope to use some of the manian only 1999 sentences were available in the existing methods and try to boost the performance Tatoeba challenge. The size of the validation and of models specifically in case of low resource lan- test data was also smaller. Using back transla- guages. tion in this scenario did not prove to be ideal. This also suggests that even for high resource lan- References guages the domain of the dataset is a big factor [1] Naveen Arivazhagan, Ankur Bapna, Orhan to determine the overall translation efficacy of the Firat, Roee Aharoni, Melvin Johnson, and model. XLM uses shared sub word vocabulary Wolfgang Macherey. The missing ingredi- which means there would be a lot of words which ent in zero-shot neural machine translation, are differently used in different domains and are 2019. thus harder for the model to predict. The results [2] Mikel Artetxe, Sebastian Ruder, Dani Yo- for Romanian could have been better if transfer gatama, Gorka Labaka, and Eneko Agirre. learning was used as the UMT technique. The A call for more rigor in unsupervised cross- results are particularly interesting as for low re- lingual learning, 2020. source languages like Romanian, getting parallel domain specific data is very tough and adapting a [3] Yong Cheng. Semi-supervised learning for pre-trained model does not help in the same way neural machine translation. In Joint Training as it does in cases of high resource languages. We for Neural Machine Translation, pages 25– wish to explore this further with some extra exper- 40. Springer, 2019. iments as well. [4] Zi-Yi Dou, Junjie Hu, Antonios Anasta- sopoulos, and Graham Neubig. Unsuper- 5 Conclusion and Future-Work vised domain adaptation for neural machine translation with domain-aware feature em- In this work, we compare two different solutions beddings, 2019. for low resource language MT: unsupervised MT with BT and zero-shot transfer learning. We focus [5] Orhan Firat, Baskaran Sankaran, Yaser on how the size affects the performance for both Al-Onaizan, Fatos T. Yarman Vural, and unsupervised MT and transfer and how the domain Kyunghyun Cho. Zero-resource translation matters in unsupervised MT. with multi-lingual neural machine transla- For transfer learning, we find that if the model tion, 2016. is transferred from a ”not so similar” language, [6] Caglar Gulcehre, Orhan Firat, Kelvin Xu,
Kyunghyun Cho, Loic Barrault, Huei-Chi [13] Anoop Kunchukuttan, Pratik Mehta, and Lin, Fethi Bougares, Holger Schwenk, and Pushpak Bhattacharyya. The IIT bom- Yoshua Bengio. On using monolingual cor- bay english-hindi parallel corpus. CoRR, pora in neural machine translation. arXiv abs/1710.02855, 2017. URL http:// preprint arXiv:1503.03535, 2015. arxiv.org/abs/1710.02855. [7] Francisco Guzmán, Peng-Jen Chen, Myle [14] Guillaume Lample and Alexis Conneau. Ott, Juan Pino, Guillaume Lample, Cross-lingual language model pretraining, Philipp Koehn, Vishrav Chaudhary, and 2019. Marc’Aurelio Ranzato. Two new evaluation [15] Guillaume Lample, Alexis Conneau, Lu- datasets for low-resource machine transla- dovic Denoyer, and Marc’Aurelio Ran- tion: Nepali-english and sinhala-english. zato. Unsupervised machine translation us- 2019. ing monolingual corpora only. arXiv preprint [8] Wilbert Heeringa, Jelena Golubovic, Char- arXiv:1711.00043, 2017. lotte Gooskens, Anja Schüppert, Femke [16] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Swarte, and Stefanie Voigt. Lexical and or- Li, Sergey Edunov, Marjan Ghazvininejad, thographic distances between Germanic, Ro- Mike Lewis, and Luke Zettlemoyer. Multi- mance and Slavic languages and their rela- lingual denoising pre-training for neural ma- tionship to geographic distance, pages 99– chine translation, 2020. 137. P.I.E. - Peter Lang, 2013. [17] Minh-Thang Luong and Christopher D Man- [9] Melvin Johnson, Mike Schuster, Quoc V. Le, ning. Stanford neural machine translation Maxim Krikun, Yonghui Wu, Zhifeng Chen, systems for spoken language domains. In Nikhil Thorat, Fernanda Viégas, Martin Wat- Proceedings of the International Workshop tenberg, Greg Corrado, Macduff Hughes, on Spoken Language Translation, pages 76– and Jeffrey Dean. Google’s multilingual 79, 2015. neural machine translation system: Enabling [18] Kelly Marchisio, Kevin Duh, and Philipp zero-shot translation. Transactions of the Koehn. When does unsupervised machine Association for Computational Linguistics, translation work?, 2020. 5:339–351, 2017. doi: 10.1162/tacl a [19] Toan Q. Nguyen and David Chiang. Trans- 00065. URL https://www.aclweb. fer learning across low-resource, related lan- org/anthology/Q17-1024. guages for neural machine translation, 2017. [10] Tom Kocmi and Ondřej Bojar. Trivial trans- [20] Rico Sennrich, Barry Haddow, and Alexan- fer learning for low-resource neural ma- dra Birch. Improving neural machine trans- chine translation. Proceedings of the Third lation models with monolingual data, 2016. Conference on Machine Translation: Re- [21] Jörg Tiedemann. The tatoeba translation search Papers, 2018. doi: 10.18653/v1/ challenge – realistic data sets for low re- w18-6325. URL http://dx.doi.org/ source and multilingual mt, 2020. 10.18653/v1/W18-6325. [22] Michał Ziemski, Marcin Junczys-Dowmunt, [11] Philipp Koehn. Europarl: A Paral- and Bruno Pouliquen. The United Nations lel Corpus for Statistical Machine Trans- parallel corpus v1.0. In Proceedings of the lation. In Conference Proceedings: the Tenth International Conference on Language tenth Machine Translation Summit, pages Resources and Evaluation (LREC’16), pages 79–86, Phuket, Thailand, 2005. AAMT, 3530–3534, Portorož, Slovenia, May 2016. AAMT. URL http://mt-archive. European Language Resources Association info/MTS-2005-Koehn.pdf. (ELRA). URL https://www.aclweb. [12] Anoop Kunchukuttan. The IndicNLP org/anthology/L16-1561. Library. https://github.com/ [23] Barret Zoph, Deniz Yuret, Jonathan May, anoopkunchukuttan/indic_nlp_ and Kevin Knight. Transfer learning library/blob/master/docs/ for low-resource neural machine transla- indicnlp.pdf, 2020. tion. In Proceedings of the 2016 Confer-
ence on Empirical Methods in Natural Lan- guage Processing, pages 1568–1575, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/ v1/D16-1163. URL https://www. aclweb.org/anthology/D16-1163.
You can also read