UDS-DFKI Submission to the WMT2019 Similar Language Translation Shared Task

Page created by Harvey Glover

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

UDS–DFKI Submission to the WMT2019 Similar Language Translation
                              Shared Task
                    Santanu Pal1,3 , Marcos Zampieri2 , Josef van Genabith1,3
        1
         Department of Language Science and Technology, Saarland University, Germany
2
  Research Institute for Information and Language Processing, University of Wolverhampton, UK
                   3
                     German Research Center for Artificial Intelligence (DFKI),
                              Saarland Informatics Campus, Germany
                              santanu.pal@uni-saarland.de

                       Abstract                                ing, development, and testing data from the fol-
                                                               lowing language pairs: Spanish - Portuguese (Ro-
     In this paper we present the UDS-DFKI sys-
                                                               mance languages), Czech - Polish (Slavic lan-
     tem submitted to the Similar Language Trans-
     lation shared task at WMT 2019. The first                 guages), and Hindi - Nepali (Indo-Aryan lan-
     edition of this shared task featured data from            guages). Participant could submit system outputs
     three pairs of similar languages: Czech and               to any of the three language pairs in any direction.
     Polish, Hindi and Nepali, and Portuguese and              The shared task attracted a good number of par-
     Spanish. Participants could choose to partic-             ticipants and the performance of all entries was
     ipate in any of these three tracks and sub-               evaluated using popular MT automatic evaluation
     mit system outputs in any translation direc-              metrics, namely BLEU (Papineni et al., 2002) and
     tion. We report the results obtained by our
                                                               TER (Snover et al., 2006).
     system in translating from Czech to Polish and
     comment on the impact of out-of-domain test                  In this paper we describe the UDS-DFKI sys-
     data in the performance of our system. UDS-               tem to the WMT 2019 Similar Language Transla-
     DFKI achieved competitive performance rank-               tion task. The system achieved competitive per-
     ing second among ten teams in Czech to Polish             formance and ranked second among ten entries
     translation.                                              in Czech to Polish translation in terms of BLEU
                                                               score.
1    Introduction
The shared tasks organized annually at WMT pro-                2   Related Work
vide important benchmarks used in the MT com-
munity. Most of these shared tasks include En-                 With the widespread use of MT technology and
glish data, which contributes to make English the              the commercial and academic success of NMT,
most resource-rich language in MT and NLP. In                  there has been more interest in training systems
the most popular WMT shared task for example,                  to translate between languages other than English
the News task, MT systems have been trained to                 (Costa-jussà, 2017). One reason for this is the
translate texts from and to English (Bojar et al.,             growing need of direct translation between pairs
2016, 2017).                                                   of similar languages, and to a lesser extent lan-
   This year, we have observed a shift on the                  guage varieties, without the use of English as a
dominant role that English on the WMT shared                   pivot language. The main challenge is to over-
tasks. The News task featured for the first time               come the limitation of available parallel data tak-
two language pairs which did not include En-                   ing advantage of the similarity between languages.
glish: German-Czech and French-German. In ad-                  Studies have been published on translating be-
dition to that, the Similar Language Translation               tween similar languages (e.g. Catalan - Spanish
was organized for the first time at WMT 2019 with              (Costa-jussà, 2017)) and language varieties such
the purpose of evaluating the performance of MT                as European and Brazilian Portuguese (Fancellu
systems on three pairs of similar languages from               et al., 2014; Costa-jussà et al., 2018). The study
three different language families: Ibero-Romance,              by Lakew et al. (2018) tackles both training MT
Indo-Aryan, and Slavic.                                        systems to translate between European–Brazilian
   The Similar Language Translation (Barrault                  Portuguese and European–Canadian French, and
et al., 2019) task provided participants with train-           two pairs of similar languages Croatian–Serbian

                                                         917
    Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 917–921
                    Florence, Italy, August 1-2, 2019. c 2019 Association for Computational Linguistics

and Indonesian–Malay. et al., 2011), which are very similar to the in-
Processing similar languages and language va- domain data (for our case the development set),
rieties has attracted attention not only in the MT and (ii) transferenceALL, utilizing all the released
community but in NLP in general. This is evi- out-domain data sorted by Equation 1.
denced by a number of research papers published The transference500Ktraining set is prepared
in the last few years and the recent iterations using in-domain (development set) bilingual
of the VarDial evaluation campaign which fea- cross-entropy difference for data selection as was
tured multiple shared tasks on topics such as di- described in Axelrod et al. (2011). The differ-
alect detection, morphosyntactic tagging, cross- ence in cross-entropy is computed based on two
lingual parsing, cross-lingual morphological anal- language models (LM): a domain-specific LM is
ysis (Zampieri et al., 2018, 2019). estimated from the in-domain (containing 2050
sentences) corpus (lmi ) and the out-domain LM
3 Data (lmo ) is estimated from the eScape corpus. We
rank the eScape corpus by assigning a score to
We used the Czech–Polish dataset provided by the
each of the individual sentences which is the sum
WMT 2019 Similar Language Translation task or-
of the three cross-entropy (H) differences. For a
ganizers for our experiments. The released paral-
j th sentence pair srcj –trg j , the score is calculated
lel dataset consists of out-of-domain (or general-
based on Equation 1.
domain) data only and it differs substantially from
the released development set which is part of a
TED corpus. The parallel data includes Europarl
score = |Hsrc (srcj , lmi ) − Hsrc (srcj , lmo )|
v9, Wiki-titles v1, and JRC-Acquis. We com-
bine all the released data and prepare a large out- + |Htrg (trgj , lmi ) − Htrg (trgj , lmo )| (1)
domain dataset.

3.1 Pre-processing 4 System Architecture - The
Transference Model
The out-domain data is noisy for our purposes, so
we apply methods for cleaning. We performed the Our transference model extends the original
following two steps: (i) we use the cleaning pro- transformer model to multi-encoder based trans-
cess described in Pal et al. (2015), and (ii) we exe- former architecture. The transformer architec-
cute the Moses (Koehn et al., 2007) corpus clean- ture (Vaswani et al., 2017) is built solely upon such
ing scripts with minimum and maximum number attention mechanisms completely replacing recur-
of tokens set to 1 and 100, respectively. After rence and convolutions. The transformer uses po-
cleaning, we perform punctuation normalization, sitional encoding to encode the input and output
and then we use the Moses tokenizer to tokenize sequences, and computes both self- and cross-
the out-domain corpus with ‘no-escape’ option. attention through so-called multi-head attentions,
Finally, we apply true-casing. which are facilitated by parallelization. We use
The cleaned version of the released data, i.e., multi-head attention to jointly attend to informa-
the General corpus containing 1,394,319 sen- tion at different positions from different represen-
tences, is sorted based on the score in Equation tation subspaces.
1. Thereafter, We split the entire data (1,394,319) The first encoder (enc1 ) of our model encodes
into two sets; we use the first 1,000 for valida- word form information of the source (fw ), and
tion and the remaining as training data. The re- a second sub-encoder (enc2 ) to encode sub-word
leased development set (Dev) is used as test data (byte-pair-encoding) information of the source
for our experiment. It should be noted noted that, (fs ). Additionally, a second encoder (encsrc→mt )
we exclude 1,000 sentences from the General cor- which takes the encoded representation from the
pus which are scored as top (i.e., more in-domain enc1 , combines this with the self-attention-based
like) during the data selection process. encoding of fs (enc2 ), and prepares a represen-
We prepare two parallel training sets from tation for the decoder (dece ) via cross-attention.
the aforementioned training data: (i) transfer- Our second encoder (enc1→2 ) can be viewed as
ence500K(presented next), collected 500,000 par- a transformer based NMT’s decoding block, how-
allel data through data selection method (Axelrod ever, without masking. The intuition behind our

918

architecture is to generate better representations 5 Experiments
via both self- and cross-attention and to further fa-
cilitate the learning capacity of the feed-forward We explore our transference model –a two-
layer in the decoder block. In our transference encoder based transformer architecture, in CS-PL
model, one self-attended encoder for fw , fw = similar language translation task.
(w1 , w2 , . . . , wk ), returns a sequence of contin- 5.1 Experiment Setup
uous representations, enc2 , and a second self-
attended sub-encoder for fs , fs = (s1 , s2 , . . . , sl ), For transferenceALL, we initially train on the
returns another sequence of continuous represen- complete out-of-domain dataset (General). The
tations, enc2 . Self-attention at this point pro- General data is sorted based on their in-domain
vides the advantage of aggregating information similarities as described in Equation 1.
from all of the words, including fw and fs , and transferenceALLmodels are then fine-tuned to-
successively generates a new representation per wards the 500K (in-domain-like) data. Finally,
word informed by the entire fw and fs context. we perform checkpoint averaging using the 8 best
The internal enc2 representation performs cross- checkpoints. We report the results on the provided
attention over enc1 and prepares a final repre- development set, which we use as a test set before
sentation (enc1→2 ) for the decoder (dece ). The the submission. Additionally we also report the
decoder generates the e output in sequence, e = official test set result.
(e1 , e2 , . . . , en ), one word at a time from left to To handle out-of-vocabulary words and to re-
right by attending to previously generated words duce the vocabulary size, instead of considering
as well as the final representations (enc1→2 ) gen- words, we consider subword units (Sennrich et al.,
erated by the encoder. 2016) by using byte-pair encoding (BPE). In the
We use the scale-dot attention mechanism (like preprocessing step, instead of learning an explicit
Vaswani et al. (2017)) for both self- and cross- mapping between BPEs in the Czech (CS) and
attention, as defined in Equation 2, where Q, K Polish (PL), we define BPE tokens by jointly pro-
and V are query, key and value, respectively, and cessing all parallel data. Thus, CS and PL derive
dk is the dimension of K. a single BPE vocabulary. Since CS and PL be-
long to the similar language, they naturally share
QK T a good fraction of BPE tokens, which reduces the
attention(Q, K, V ) = sof tmax( √ )V (2) vocabulary size.
dk
We pass word level information on the first en-
The multi-head attention mechanism in the trans- coder and the BPE information to the second one.
former network maps the Q, K, and V matrices by On the decoder side of the transference model we
using different linear projections. Then h paral- pass only BPE text.
lel heads are employed to focus on different parts We evaluate our approach with development
in V. The ith multi-head attention is denoted by data which is used as test case before submission.
headi in Equation 3. headi is linearly learned by We use BLEU (Papineni et al., 2002) and TER
three projection parameter matrices: WiQ , WiK ∈ (Snover et al., 2006).
Rdmodel ×dk , WiV ∈ Rdmodel ×dv ; where dk = dv =
dmodel /h, and dmodel is the number of hidden units 5.2 Hyper-parameter Setup
of our network. We follow a similar hyper-parameter setup for all
reported systems. All encoders, and the decoder,
headi = attention(QWiQ , KWiK , V WiV ) (3) are composed of a stack of Nf w = Nf s = Nes =
6 identical layers followed by layer normalization.
Finally, all the vectors produced by parallel heads
Each layer again consists of two sub-layers and a
are linearly projected using concatenation and
residual connection (He et al., 2016) around each
form a single vector, called a multi-head attention
of the two sub-layers. We apply dropout (Srivas-
(Matt ) (cf. Equation 4). Here the dimension of the
tava et al., 2014) to the output of each sub-layer,
learned weight matrix W O is Rdmodel ×dmodel .
before it is added to the sub-layer input and nor-
malized. Furthermore, dropout is applied to the
Matt (Q, K, V ) = Concatni:1 (headi )W O (4)
sums of the word embeddings and the correspond-
ing positional encodings in both encoders as well

919

as the decoder stacks. model. Similar improvement is also observed in
We set all dropout values in the network to terms of TER (-16.9 absolute). It is to be noted that
0.1. During training, we employ label smoothing our generic model is trained solely on the clean
with value ls = 0.1. The output dimension pro- version of training data.
duced by all sub-layers and embedding layers is Before submission, we performed punctuation
dmodel = 512. Each encoder and decoder layer normalization, unicode normalization, and detok-
contains a fully connected feed-forward network enization for the run.
(F F N ) having dimensionality of dmodel = 512 In Table 2 we present the ranking of the compe-
for the input and output and dimensionality of tition provided by the shared task organizers. Ten
df f = 2048 for the inner layers. For the scaled entries were submitted by five teams and are or-
dot-product attention, the input consists of queries dered by BLEU score. TER is reported for all
and keys of dimension dk , and values of dimen- submissions which achieved BLEU score greater
sion dv . As multi-head attention parameters, we than 5.0. The type column specifies the type of
employ h = 8 for parallel attention layers, or system, whether it is a Primary (P) or Constrastive
heads. For each of these we use a dimensional- (C) entry.
ity of dk = dv = dmodel /h = 64. For optimiza-
tion, we use the Adam optimizer (Kingma and Ba, Team Type BLEU TER
2015) with β1 = 0.9, β2 = 0.98 and = 10−9 . UPC-TALP P 7.9 85.9
The learning rate is varied throughout the train- UDS-DFKI P 7.6 87.0
ing process, and increasing for the first training Uhelsinki P 7.1 87.4
steps warmupsteps = 8000 and afterwards de- Uhelsinki C 7.0 87.3
creasing as described in (Vaswani et al., 2017). All Incomslav C 5.9 88.4
remaining hyper-parameters are set analogously to Uhelsinki C 5.9 88.4
those of the transformer’s base model. At training Incomslav P 3.2 -
time, the batch size is set to 25K tokens, with a Incomslav C 3.1 -
maximum sentence length of 256 subwords, and UBC-NLP C 2.3 -
a vocabulary size of 28K. After each epoch, the UBC-NLP P 2.2 -
training data is shuffled. After finishing training,
Table 2: Rank table for Czech to Polish Translation
we save the 5 best checkpoints which are written
at each epoch. Finally, we use a single model ob-
Our system was ranked second in the competition
tained by averaging the last 5 checkpoints. During
only 0.3 BLEU points behind the winning team
decoding, we perform beam search with a beam
UPC-TALP. The relative low BLEU and high TER
size of 4. We use shared embeddings between CS
scores obtained by all teams are due to out-of-
and PL in all our experiments.
domain data provided in the competition which
6 Results made the task equally challenging to all partici-
pants.
We present the results obtained by our system in
Table 1. 7 Conclusion

tested on model BLEU TER This paper presented the UDS-DFKI system sub-
Dev set Generic 12.2 75.8 mitted to the Similar Language Translation shared
Dev set Fine-tuned* 25.1 58.9 task at WMT 2019. We presented the results ob-
Test set Generic 7.1 89.3 tained by our system in translating from Czech
Test set Fine-Tuned* 7.6 87.0 to Polish. Our system achieved competitive per-
formance ranking second among ten teams in the
Table 1: Results for CS–PL Translation; * averaging 8 competition in terms of BLEU score. The fact that
best checkpoints. out-of-domain data was provided by the organiz-
ers resulted in a challenging but interesting sce-
Our fine-tuned system on development set pro- nario for all participants.
vides significant performance improvement over In future work, we would like to investigate how
the generic model. We found +12.9 abso- effective is the proposed hypothesis (i.e., word-
lute BLEU points improvement over the generic BPE level information) in similar language trans-

920

lation. Furthermore, we would like to explore Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
the similarity between these two languages (and Sun. 2016. Deep Residual Learning for Image
Recognition. Proceedings of CVPR.
the other two language pairs in the competition)
in more detail by training models that can best Diederik P Kingma and Jimmy Lei Ba. 2015. Adam:
capture morphological differences between them. A Method for Stochastic Optimization. Proceedings
During such competitions, this is not always pos- of ICLR.
sible due to time constraints. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Acknowledgments Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
This research was funded in part by the Ger- Constantin, and Evan Herbst. 2007. Moses: Open
man research foundation (DFG) under grant num- Source Toolkit for Statistical Machine Translation.
ber GE 2819/2-1 (project MMPE) and the Ger- In Proceedings of ACL.
man Federal Ministry of Education and Research Surafel M Lakew, Aliia Erofeeva, and Marcello Fed-
(BMBF) under funding code 01IW17001 (project erico. 2018. Neural Machine Translation into Lan-
Deeplee). The responsibility for this publication guage Varieties. arXiv preprint arXiv:1811.01064.
lies with the authors. We would like to thank the Santanu Pal, Sudip Naskar, and Josef van Genabith.
anonymous WMT reviewers for their valuable in- 2015. UdS-sant: English–German hybrid machine
put, and the organizers of the shared task. translation system. In Proceedings of WMT.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: A method for automatic eval-
References uation of machine translation. In Proceedings of
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. ACL.
2011. Domain Adaptation via Pseudo In-domain
Data Selection. In Proceedings of EMNLP. Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural Machine Translation of Rare Words
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, with Subword Units. In Proceedings of ACL.
Christian Federmann, Mark Fishel, Yvette Gra-
ham, Barry Haddow, Matthias Huck, Philipp Koehn, Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Shervin Malmasi, Christof Monz, Mathias Müller, nea Micciulla, and John Makhoul. 2006. A Study of
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Translation Edit Rate with Targeted Human Annota-
Findings of the 2019 Conference on Machine Trans- tion. In Proceedings of AMTA.
lation (WMT19). In Proceedings of WMT. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Yvette Graham, Barry Haddow, Shujian Huang, Dropout: A Simple Way to Prevent Neural Networks
Matthias Huck, Philipp Koehn, Qun Liu, Varvara from Overfitting. J. Mach. Learn. Res., 15(1):1929–
Logacheva, et al. 2017. Findings of the 2017 Con- 1958.
ference on Machine Translation (WMT17). In Pro- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ceedings of WMT. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Kaiser, and Illia Polosukhin. 2017. Attention Is All
Yvette Graham, Barry Haddow, Matthias Huck, An- You Need. In Proceedings of NIPS.
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
gacheva, Christof Monz, et al. 2016. Findings of the
Ahmed Ali, Suwon Shon, James Glass, Yves Scher-
2016 Conference on Machine Translation. In Pro-
rer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiede-
ceedings of WMT.
mann, Chris van der Lee, Stefan Grondelaers,
Marta R. Costa-jussà. 2017. Why Catalan-Spanish Nelleke Oostdijk, Dirk Speelman, Antal van den
Neural Machine Translation? Analysis, Comparison Bosch, Ritesh Kumar, Bornini Lahiri, and Mayank
and Combination with Standard Rule and Phrase- Jain. 2018. Language Identification and Mor-
based Technologies. In Proceedings of VarDial. phosyntactic Tagging: The Second VarDial Evalu-
ation Campaign. In Proceedings of VarDial.
Marta R. Costa-jussà, Marcos Zampieri, and Santanu
Pal. 2018. A Neural Approach to Language Variety Marcos Zampieri, Shervin Malmasi, Yves Scherrer,
Translation. In Proceedings of VarDial. Tanja Samardžić, Francis Tyers, Miikka Silfverberg,
Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang,
Federico Fancellu, Andy Way, and Morgan OBrien. Radu Tudor Ionescu, Andrei Butnaru, and Tommi
2014. Standard Language Variety Conversion for Jauhiainen. 2019. A Report on the Third VarDial
Content Localisation via SMT. In Proceedings of Evaluation Campaign. In Proceedings of VarDial.
EAMT.

921

You can also read