UDS-DFKI Submission to the WMT2019 Similar Language Translation Shared Task

Page created by Harvey Glover
 
CONTINUE READING
UDS–DFKI Submission to the WMT2019 Similar Language Translation
                              Shared Task
                    Santanu Pal1,3 , Marcos Zampieri2 , Josef van Genabith1,3
        1
         Department of Language Science and Technology, Saarland University, Germany
2
  Research Institute for Information and Language Processing, University of Wolverhampton, UK
                   3
                     German Research Center for Artificial Intelligence (DFKI),
                              Saarland Informatics Campus, Germany
                              santanu.pal@uni-saarland.de

                       Abstract                                ing, development, and testing data from the fol-
                                                               lowing language pairs: Spanish - Portuguese (Ro-
     In this paper we present the UDS-DFKI sys-
                                                               mance languages), Czech - Polish (Slavic lan-
     tem submitted to the Similar Language Trans-
     lation shared task at WMT 2019. The first                 guages), and Hindi - Nepali (Indo-Aryan lan-
     edition of this shared task featured data from            guages). Participant could submit system outputs
     three pairs of similar languages: Czech and               to any of the three language pairs in any direction.
     Polish, Hindi and Nepali, and Portuguese and              The shared task attracted a good number of par-
     Spanish. Participants could choose to partic-             ticipants and the performance of all entries was
     ipate in any of these three tracks and sub-               evaluated using popular MT automatic evaluation
     mit system outputs in any translation direc-              metrics, namely BLEU (Papineni et al., 2002) and
     tion. We report the results obtained by our
                                                               TER (Snover et al., 2006).
     system in translating from Czech to Polish and
     comment on the impact of out-of-domain test                  In this paper we describe the UDS-DFKI sys-
     data in the performance of our system. UDS-               tem to the WMT 2019 Similar Language Transla-
     DFKI achieved competitive performance rank-               tion task. The system achieved competitive per-
     ing second among ten teams in Czech to Polish             formance and ranked second among ten entries
     translation.                                              in Czech to Polish translation in terms of BLEU
                                                               score.
1    Introduction
The shared tasks organized annually at WMT pro-                2   Related Work
vide important benchmarks used in the MT com-
munity. Most of these shared tasks include En-                 With the widespread use of MT technology and
glish data, which contributes to make English the              the commercial and academic success of NMT,
most resource-rich language in MT and NLP. In                  there has been more interest in training systems
the most popular WMT shared task for example,                  to translate between languages other than English
the News task, MT systems have been trained to                 (Costa-jussà, 2017). One reason for this is the
translate texts from and to English (Bojar et al.,             growing need of direct translation between pairs
2016, 2017).                                                   of similar languages, and to a lesser extent lan-
   This year, we have observed a shift on the                  guage varieties, without the use of English as a
dominant role that English on the WMT shared                   pivot language. The main challenge is to over-
tasks. The News task featured for the first time               come the limitation of available parallel data tak-
two language pairs which did not include En-                   ing advantage of the similarity between languages.
glish: German-Czech and French-German. In ad-                  Studies have been published on translating be-
dition to that, the Similar Language Translation               tween similar languages (e.g. Catalan - Spanish
was organized for the first time at WMT 2019 with              (Costa-jussà, 2017)) and language varieties such
the purpose of evaluating the performance of MT                as European and Brazilian Portuguese (Fancellu
systems on three pairs of similar languages from               et al., 2014; Costa-jussà et al., 2018). The study
three different language families: Ibero-Romance,              by Lakew et al. (2018) tackles both training MT
Indo-Aryan, and Slavic.                                        systems to translate between European–Brazilian
   The Similar Language Translation (Barrault                  Portuguese and European–Canadian French, and
et al., 2019) task provided participants with train-           two pairs of similar languages Croatian–Serbian

                                                         917
    Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 917–921
                    Florence, Italy, August 1-2, 2019. c 2019 Association for Computational Linguistics
and Indonesian–Malay.                                         et al., 2011), which are very similar to the in-
   Processing similar languages and language va-              domain data (for our case the development set),
rieties has attracted attention not only in the MT            and (ii) transferenceALL, utilizing all the released
community but in NLP in general. This is evi-                 out-domain data sorted by Equation 1.
denced by a number of research papers published                   The transference500Ktraining set is prepared
in the last few years and the recent iterations               using in-domain (development set) bilingual
of the VarDial evaluation campaign which fea-                 cross-entropy difference for data selection as was
tured multiple shared tasks on topics such as di-             described in Axelrod et al. (2011). The differ-
alect detection, morphosyntactic tagging, cross-              ence in cross-entropy is computed based on two
lingual parsing, cross-lingual morphological anal-            language models (LM): a domain-specific LM is
ysis (Zampieri et al., 2018, 2019).                           estimated from the in-domain (containing 2050
                                                              sentences) corpus (lmi ) and the out-domain LM
3     Data                                                    (lmo ) is estimated from the eScape corpus. We
                                                              rank the eScape corpus by assigning a score to
We used the Czech–Polish dataset provided by the
                                                              each of the individual sentences which is the sum
WMT 2019 Similar Language Translation task or-
                                                              of the three cross-entropy (H) differences. For a
ganizers for our experiments. The released paral-
                                                              j th sentence pair srcj –trg j , the score is calculated
lel dataset consists of out-of-domain (or general-
                                                              based on Equation 1.
domain) data only and it differs substantially from
the released development set which is part of a
TED corpus. The parallel data includes Europarl
                                                                  score = |Hsrc (srcj , lmi ) − Hsrc (srcj , lmo )|
v9, Wiki-titles v1, and JRC-Acquis. We com-
bine all the released data and prepare a large out-                  + |Htrg (trgj , lmi ) − Htrg (trgj , lmo )| (1)
domain dataset.

3.1    Pre-processing                                         4    System Architecture - The
                                                                   Transference Model
The out-domain data is noisy for our purposes, so
we apply methods for cleaning. We performed the               Our transference model extends the original
following two steps: (i) we use the cleaning pro-             transformer model to multi-encoder based trans-
cess described in Pal et al. (2015), and (ii) we exe-         former architecture. The transformer architec-
cute the Moses (Koehn et al., 2007) corpus clean-             ture (Vaswani et al., 2017) is built solely upon such
ing scripts with minimum and maximum number                   attention mechanisms completely replacing recur-
of tokens set to 1 and 100, respectively. After               rence and convolutions. The transformer uses po-
cleaning, we perform punctuation normalization,               sitional encoding to encode the input and output
and then we use the Moses tokenizer to tokenize               sequences, and computes both self- and cross-
the out-domain corpus with ‘no-escape’ option.                attention through so-called multi-head attentions,
Finally, we apply true-casing.                                which are facilitated by parallelization. We use
   The cleaned version of the released data, i.e.,            multi-head attention to jointly attend to informa-
the General corpus containing 1,394,319 sen-                  tion at different positions from different represen-
tences, is sorted based on the score in Equation              tation subspaces.
1. Thereafter, We split the entire data (1,394,319)              The first encoder (enc1 ) of our model encodes
into two sets; we use the first 1,000 for valida-             word form information of the source (fw ), and
tion and the remaining as training data. The re-              a second sub-encoder (enc2 ) to encode sub-word
leased development set (Dev) is used as test data             (byte-pair-encoding) information of the source
for our experiment. It should be noted noted that,            (fs ). Additionally, a second encoder (encsrc→mt )
we exclude 1,000 sentences from the General cor-              which takes the encoded representation from the
pus which are scored as top (i.e., more in-domain             enc1 , combines this with the self-attention-based
like) during the data selection process.                      encoding of fs (enc2 ), and prepares a represen-
   We prepare two parallel training sets from                 tation for the decoder (dece ) via cross-attention.
the aforementioned training data: (i) transfer-               Our second encoder (enc1→2 ) can be viewed as
ence500K(presented next), collected 500,000 par-              a transformer based NMT’s decoding block, how-
allel data through data selection method (Axelrod             ever, without masking. The intuition behind our

                                                        918
architecture is to generate better representations                  5     Experiments
via both self- and cross-attention and to further fa-
cilitate the learning capacity of the feed-forward                  We explore our transference model –a two-
layer in the decoder block. In our transference                     encoder based transformer architecture, in CS-PL
model, one self-attended encoder for fw , fw =                      similar language translation task.
(w1 , w2 , . . . , wk ), returns a sequence of contin-              5.1    Experiment Setup
uous representations, enc2 , and a second self-
attended sub-encoder for fs , fs = (s1 , s2 , . . . , sl ),         For transferenceALL, we initially train on the
returns another sequence of continuous represen-                    complete out-of-domain dataset (General). The
tations, enc2 . Self-attention at this point pro-                   General data is sorted based on their in-domain
vides the advantage of aggregating information                      similarities as described in Equation 1.
from all of the words, including fw and fs , and                       transferenceALLmodels are then fine-tuned to-
successively generates a new representation per                     wards the 500K (in-domain-like) data. Finally,
word informed by the entire fw and fs context.                      we perform checkpoint averaging using the 8 best
The internal enc2 representation performs cross-                    checkpoints. We report the results on the provided
attention over enc1 and prepares a final repre-                     development set, which we use as a test set before
sentation (enc1→2 ) for the decoder (dece ). The                    the submission. Additionally we also report the
decoder generates the e output in sequence, e =                     official test set result.
(e1 , e2 , . . . , en ), one word at a time from left to               To handle out-of-vocabulary words and to re-
right by attending to previously generated words                    duce the vocabulary size, instead of considering
as well as the final representations (enc1→2 ) gen-                 words, we consider subword units (Sennrich et al.,
erated by the encoder.                                              2016) by using byte-pair encoding (BPE). In the
   We use the scale-dot attention mechanism (like                   preprocessing step, instead of learning an explicit
Vaswani et al. (2017)) for both self- and cross-                    mapping between BPEs in the Czech (CS) and
attention, as defined in Equation 2, where Q, K                     Polish (PL), we define BPE tokens by jointly pro-
and V are query, key and value, respectively, and                   cessing all parallel data. Thus, CS and PL derive
dk is the dimension of K.                                           a single BPE vocabulary. Since CS and PL be-
                                                                    long to the similar language, they naturally share
                                QK T                                a good fraction of BPE tokens, which reduces the
 attention(Q, K, V ) = sof tmax( √ )V (2)                           vocabulary size.
                                  dk
                                                                       We pass word level information on the first en-
The multi-head attention mechanism in the trans-                    coder and the BPE information to the second one.
former network maps the Q, K, and V matrices by                     On the decoder side of the transference model we
using different linear projections. Then h paral-                   pass only BPE text.
lel heads are employed to focus on different parts                     We evaluate our approach with development
in V. The ith multi-head attention is denoted by                    data which is used as test case before submission.
headi in Equation 3. headi is linearly learned by                   We use BLEU (Papineni et al., 2002) and TER
three projection parameter matrices: WiQ , WiK ∈                    (Snover et al., 2006).
Rdmodel ×dk , WiV ∈ Rdmodel ×dv ; where dk = dv =
dmodel /h, and dmodel is the number of hidden units                 5.2    Hyper-parameter Setup
of our network.                                                     We follow a similar hyper-parameter setup for all
                                                                    reported systems. All encoders, and the decoder,
  headi = attention(QWiQ , KWiK , V WiV ) (3)                       are composed of a stack of Nf w = Nf s = Nes =
                                                                    6 identical layers followed by layer normalization.
Finally, all the vectors produced by parallel heads
                                                                    Each layer again consists of two sub-layers and a
are linearly projected using concatenation and
                                                                    residual connection (He et al., 2016) around each
form a single vector, called a multi-head attention
                                                                    of the two sub-layers. We apply dropout (Srivas-
(Matt ) (cf. Equation 4). Here the dimension of the
                                                                    tava et al., 2014) to the output of each sub-layer,
learned weight matrix W O is Rdmodel ×dmodel .
                                                                    before it is added to the sub-layer input and nor-
                                                                    malized. Furthermore, dropout is applied to the
   Matt (Q, K, V ) = Concatni:1 (headi )W O            (4)
                                                                    sums of the word embeddings and the correspond-
                                                                    ing positional encodings in both encoders as well

                                                              919
as the decoder stacks.                                        model. Similar improvement is also observed in
   We set all dropout values in the network to                terms of TER (-16.9 absolute). It is to be noted that
0.1. During training, we employ label smoothing               our generic model is trained solely on the clean
with value ls = 0.1. The output dimension pro-               version of training data.
duced by all sub-layers and embedding layers is                   Before submission, we performed punctuation
dmodel = 512. Each encoder and decoder layer                  normalization, unicode normalization, and detok-
contains a fully connected feed-forward network               enization for the run.
(F F N ) having dimensionality of dmodel = 512                    In Table 2 we present the ranking of the compe-
for the input and output and dimensionality of                tition provided by the shared task organizers. Ten
df f = 2048 for the inner layers. For the scaled              entries were submitted by five teams and are or-
dot-product attention, the input consists of queries          dered by BLEU score. TER is reported for all
and keys of dimension dk , and values of dimen-               submissions which achieved BLEU score greater
sion dv . As multi-head attention parameters, we              than 5.0. The type column specifies the type of
employ h = 8 for parallel attention layers, or                system, whether it is a Primary (P) or Constrastive
heads. For each of these we use a dimensional-                (C) entry.
ity of dk = dv = dmodel /h = 64. For optimiza-
tion, we use the Adam optimizer (Kingma and Ba,                        Team            Type    BLEU       TER
2015) with β1 = 0.9, β2 = 0.98 and  = 10−9 .                          UPC-TALP         P       7.9       85.9
   The learning rate is varied throughout the train-                   UDS-DFKI         P       7.6       87.0
ing process, and increasing for the first training                     Uhelsinki        P       7.1       87.4
steps warmupsteps = 8000 and afterwards de-                            Uhelsinki        C       7.0       87.3
creasing as described in (Vaswani et al., 2017). All                   Incomslav        C       5.9       88.4
remaining hyper-parameters are set analogously to                      Uhelsinki        C       5.9       88.4
those of the transformer’s base model. At training                     Incomslav        P       3.2        -
time, the batch size is set to 25K tokens, with a                      Incomslav        C       3.1        -
maximum sentence length of 256 subwords, and                           UBC-NLP          C       2.3        -
a vocabulary size of 28K. After each epoch, the                        UBC-NLP          P       2.2        -
training data is shuffled. After finishing training,
                                                                  Table 2: Rank table for Czech to Polish Translation
we save the 5 best checkpoints which are written
at each epoch. Finally, we use a single model ob-
                                                              Our system was ranked second in the competition
tained by averaging the last 5 checkpoints. During
                                                              only 0.3 BLEU points behind the winning team
decoding, we perform beam search with a beam
                                                              UPC-TALP. The relative low BLEU and high TER
size of 4. We use shared embeddings between CS
                                                              scores obtained by all teams are due to out-of-
and PL in all our experiments.
                                                              domain data provided in the competition which
6   Results                                                   made the task equally challenging to all partici-
                                                              pants.
We present the results obtained by our system in
Table 1.                                                      7     Conclusion

    tested on      model         BLEU      TER                This paper presented the UDS-DFKI system sub-
    Dev set       Generic         12.2     75.8               mitted to the Similar Language Translation shared
    Dev set     Fine-tuned*       25.1     58.9               task at WMT 2019. We presented the results ob-
    Test set      Generic          7.1     89.3               tained by our system in translating from Czech
    Test set    Fine-Tuned*        7.6     87.0               to Polish. Our system achieved competitive per-
                                                              formance ranking second among ten teams in the
Table 1: Results for CS–PL Translation; * averaging 8         competition in terms of BLEU score. The fact that
best checkpoints.                                             out-of-domain data was provided by the organiz-
                                                              ers resulted in a challenging but interesting sce-
Our fine-tuned system on development set pro-                 nario for all participants.
vides significant performance improvement over                   In future work, we would like to investigate how
the generic model.      We found +12.9 abso-                  effective is the proposed hypothesis (i.e., word-
lute BLEU points improvement over the generic                 BPE level information) in similar language trans-

                                                        920
lation. Furthermore, we would like to explore                  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
the similarity between these two languages (and                  Sun. 2016. Deep Residual Learning for Image
                                                                 Recognition. Proceedings of CVPR.
the other two language pairs in the competition)
in more detail by training models that can best                Diederik P Kingma and Jimmy Lei Ba. 2015. Adam:
capture morphological differences between them.                  A Method for Stochastic Optimization. Proceedings
During such competitions, this is not always pos-                of ICLR.
sible due to time constraints.                                 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
                                                                 Callison-Burch, Marcello Federico, Nicola Bertoldi,
Acknowledgments                                                  Brooke Cowan, Wade Shen, Christine Moran,
                                                                 Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
This research was funded in part by the Ger-                     Constantin, and Evan Herbst. 2007. Moses: Open
man research foundation (DFG) under grant num-                   Source Toolkit for Statistical Machine Translation.
ber GE 2819/2-1 (project MMPE) and the Ger-                      In Proceedings of ACL.
man Federal Ministry of Education and Research                 Surafel M Lakew, Aliia Erofeeva, and Marcello Fed-
(BMBF) under funding code 01IW17001 (project                     erico. 2018. Neural Machine Translation into Lan-
Deeplee). The responsibility for this publication                guage Varieties. arXiv preprint arXiv:1811.01064.
lies with the authors. We would like to thank the              Santanu Pal, Sudip Naskar, and Josef van Genabith.
anonymous WMT reviewers for their valuable in-                   2015. UdS-sant: English–German hybrid machine
put, and the organizers of the shared task.                      translation system. In Proceedings of WMT.
                                                               Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
                                                                 Jing Zhu. 2002. Bleu: A method for automatic eval-
References                                                       uation of machine translation. In Proceedings of
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.                  ACL.
 2011. Domain Adaptation via Pseudo In-domain
 Data Selection. In Proceedings of EMNLP.                      Rico Sennrich, Barry Haddow, and Alexandra Birch.
                                                                 2016. Neural Machine Translation of Rare Words
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà,            with Subword Units. In Proceedings of ACL.
  Christian Federmann, Mark Fishel, Yvette Gra-
  ham, Barry Haddow, Matthias Huck, Philipp Koehn,             Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
  Shervin Malmasi, Christof Monz, Mathias Müller,              nea Micciulla, and John Makhoul. 2006. A Study of
  Santanu Pal, Matt Post, and Marcos Zampieri. 2019.            Translation Edit Rate with Targeted Human Annota-
  Findings of the 2019 Conference on Machine Trans-             tion. In Proceedings of AMTA.
  lation (WMT19). In Proceedings of WMT.                       Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,            Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  Yvette Graham, Barry Haddow, Shujian Huang,                    Dropout: A Simple Way to Prevent Neural Networks
  Matthias Huck, Philipp Koehn, Qun Liu, Varvara                 from Overfitting. J. Mach. Learn. Res., 15(1):1929–
  Logacheva, et al. 2017. Findings of the 2017 Con-              1958.
  ference on Machine Translation (WMT17). In Pro-              Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  ceedings of WMT.                                               Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Ondrej Bojar, Rajen Chatterjee, Christian Federmann,             Kaiser, and Illia Polosukhin. 2017. Attention Is All
  Yvette Graham, Barry Haddow, Matthias Huck, An-                You Need. In Proceedings of NIPS.
  tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
                                                               Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
  gacheva, Christof Monz, et al. 2016. Findings of the
                                                                Ahmed Ali, Suwon Shon, James Glass, Yves Scher-
  2016 Conference on Machine Translation. In Pro-
                                                                rer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiede-
  ceedings of WMT.
                                                                mann, Chris van der Lee, Stefan Grondelaers,
Marta R. Costa-jussà. 2017. Why Catalan-Spanish                Nelleke Oostdijk, Dirk Speelman, Antal van den
 Neural Machine Translation? Analysis, Comparison               Bosch, Ritesh Kumar, Bornini Lahiri, and Mayank
 and Combination with Standard Rule and Phrase-                 Jain. 2018.    Language Identification and Mor-
 based Technologies. In Proceedings of VarDial.                 phosyntactic Tagging: The Second VarDial Evalu-
                                                                ation Campaign. In Proceedings of VarDial.
Marta R. Costa-jussà, Marcos Zampieri, and Santanu
 Pal. 2018. A Neural Approach to Language Variety              Marcos Zampieri, Shervin Malmasi, Yves Scherrer,
 Translation. In Proceedings of VarDial.                        Tanja Samardžić, Francis Tyers, Miikka Silfverberg,
                                                                Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang,
Federico Fancellu, Andy Way, and Morgan OBrien.                 Radu Tudor Ionescu, Andrei Butnaru, and Tommi
  2014. Standard Language Variety Conversion for                Jauhiainen. 2019. A Report on the Third VarDial
  Content Localisation via SMT. In Proceedings of               Evaluation Campaign. In Proceedings of VarDial.
  EAMT.

                                                         921
You can also read