An Effective Approach to Unsupervised Machine Translation

Page created by Jose Ford
 
CONTINUE READING
An Effective Approach to Unsupervised Machine Translation

                                                               Mikel Artetxe, Gorka Labaka, Eneko Agirre
                                                                              IXA NLP Group
                                                               University of the Basque Country (UPV/EHU)
                                                       {mikel.artetxe, gorka.labaka, e.agirre}@ehu.eus

                                                              Abstract                         tialized with cross-lingual embeddings (Artetxe
                                                                                               et al., 2018c; Lample et al., 2018a). Neverthe-
                                            While machine translation has traditionally re-
                                            lied on large amounts of parallel corpora, a re-   less, these early systems were later superseded
                                                                                               by Statistical Machine Translation (SMT) based
arXiv:1902.01313v1 [cs.CL] 4 Feb 2019

                                            cent research line has managed to train both
                                            Neural Machine Translation (NMT) and Sta-          approaches, which induced an initial phrase-table
                                            tistical Machine Translation (SMT) systems         through cross-lingual embedding mappings, com-
                                            using monolingual corpora only. In this pa-        bined it with an n-gram language model, and fur-
                                            per, we identify and address several deficien-     ther improved the system through iterative back-
                                            cies of existing unsupervised SMT approaches
                                                                                               translation (Lample et al., 2018b; Artetxe et al.,
                                            by exploiting subword information, develop-
                                            ing a theoretically well founded unsupervised
                                                                                               2018b).
                                            tuning method, and incorporating a joint re-          In this paper, we develop a more principled ap-
                                            finement procedure. Moreover, we use our im-       proach to unsupervised SMT, addressing several
                                            proved SMT system to initialize a dual NMT         deficiencies of previous systems by incorporat-
                                            model, which is further fine-tuned through on-
                                                                                               ing subword information, applying a theoretically
                                            the-fly back-translation. Together, we obtain
                                            large improvements over the previous state-        well founded unsupervised tuning method, and de-
                                            of-the-art in unsupervised machine transla-        veloping a joint refinement procedure. In addition
                                            tion. For instance, we get 22.5 BLEU points        to that, we use our improved SMT approach to ini-
                                            in English-to-German WMT 2014, 5.5 points          tialize an unsupervised NMT system, which is fur-
                                            more than the previous best unsupervised sys-      ther improved through on-the-fly back-translation.
                                            tem, and 0.5 points more than the (supervised)
                                            shared task winner back in 2014.                      Our experiments on WMT 2014/2016 French-
                                                                                               English and German-English show the effective-
                                        1   Introduction                                       ness of our approach, as our proposed system out-
                                        The recent advent of neural sequence-to-sequence       performs the previous state-of-the-art in unsuper-
                                        modeling has resulted in significant progress in the   vised machine translation by 5-7 BLEU points
                                        field of machine translation, with large improve-      in all these datasets and translation directions.
                                        ments in standard benchmarks (Vaswani et al.,          Our system also outperforms the supervised WMT
                                        2017; Edunov et al., 2018) and the first solid         2014 shared task winner in English-to-German,
                                        claims of human parity in certain settings (Has-       and is around 2 BLEU points behind it in the rest
                                        san et al., 2018). Unfortunately, these systems        of translation directions, suggesting that unsuper-
                                        rely on large amounts of parallel corpora, which       vised machine translation can be a usable alterna-
                                        are only available for a few combinations of major     tive in practical settings.
                                        languages like English, German and French.               The remaining of this paper is organized as fol-
                                           Aiming to remove this dependency on paral-          lows. Section 2 first discusses the related work in
                                        lel data, a recent research line has managed to        the topic. Section 3 then describes our principled
                                        train unsupervised machine translation systems         unsupervised SMT method, while Section 4 dis-
                                        using monolingual corpora only. The first such         cusses our hybridization method with NMT. We
                                        systems were based on Neural Machine Transla-          then present the experiments done and the results
                                        tion (NMT), and combined denoising autoencod-          obtained in Section 5, and Section 6 concludes the
                                        ing and back-translation to train a dual model ini-    paper.
2   Related work                                      cussed earlier, and use them to induce an initial
                                                      phrase-table that is combined with an n-gram lan-
Early attempts to build machine translation sys-      guage model and a distortion model. This ini-
tems with monolingual corpora go back to statis-      tial system is then refined through iterative back-
tical decipherment (Ravi and Knight, 2011; Dou        translation (Sennrich et al., 2016) which, in the
and Knight, 2012). These methods see the source       case of Artetxe et al. (2018b), is preceded by an
language as ciphertext produced by a noisy chan-      unsupervised tuning step. Our work identifies
nel model that first generates the original English   some deficiencies in these previous systems, and
text and then probabilistically replaces the words    proposes a more principled approach to unsuper-
in it. The English generative process is modeled      vised SMT that incorporates subword information,
using an n-gram language model, and the chan-         uses a theoretically better founded unsupervised
nel model parameters are estimated using either       tuning method, and applies a joint refinement pro-
expectation maximization or Bayesian inference.       cedure, outperforming these previous systems by
This basic approach was later improved by incor-      a substantial margin.
porating syntactic knowledge (Dou and Knight,            Very recently, some authors have tried to com-
2013) and word embeddings (Dou et al., 2015).         bine both SMT and NMT to build hybrid unsuper-
Nevertheless, these methods were only shown to        vised machine translation systems. This idea was
work in limited settings, being most often evalu-     already explored by Lample et al. (2018b), who
ated in word-level translation.                       aided the training of their unsupervised NMT sys-
   More recently, the task got a renewed inter-       tem by combining standard back-translation with
est after the concurrent work of Artetxe et al.       synthetic parallel data generated by unsupervised
(2018c) and Lample et al. (2018a) on unsuper-         SMT. Marie and Fujita (2018) go further and use
vised NMT which, for the first time, obtained         synthetic parallel data from unsupervised SMT to
promising results in standard machine transla-        train a conventional NMT system from scratch.
tion benchmarks using monolingual corpora only.       The resulting NMT model is then used to aug-
Both methods build upon the recent work on            ment the synthetic parallel corpus through back-
unsupervised cross-lingual embedding mappings,        translation, and a new NMT model is trained on
which independently train word embeddings in          top of it from scratch, repeating the process it-
two languages and learn a linear transformation to    eratively. Ren et al. (2019) follow a similar ap-
map them to a shared space through self-learning      proach, but use SMT as posterior regularization
(Artetxe et al., 2017, 2018a) or adversarial train-   at each iteration. As shown later in our experi-
ing (Conneau et al., 2018). The resulting cross-      ments, our proposed NMT hybridization obtains
lingual embeddings are used to initialize a shared    substantially larger absolute gains than all these
encoder for both languages, and the entire sys-       previous approaches, even if our initial SMT sys-
tem is trained using a combination of denoising       tem is stronger and thus more challenging to im-
autoencoding, back-translation and, in the case       prove upon.
of Lample et al. (2018a), adversarial training.
This method was further improved by Yang et al.       3   Principled unsupervised SMT
(2018), who use two language-specific encoders
sharing only a subset of their parameters, and in-    Phrase-based SMT is formulated as a log-linear
corporate a local and a global generative adversar-   combination of several statistical models: a trans-
ial network.                                          lation model, a language model, a reordering
   Nevertheless, it was later argued that the mod-    model and a word/phrase penalty. As such, build-
ular architecture of phrase-based SMT was more        ing an unsupervised SMT system requires learn-
suitable for this problem, and Lample et al.          ing these different components from monolingual
(2018b) and Artetxe et al. (2018b) adapted the        corpora. As it turns out, this is straightforward
same principles discussed above to train an un-       for most of them: the language model is learned
supervised SMT model, obtaining large improve-        from monolingual corpora by definition; the word
ments over the original unsupervised NMT sys-         and phrase penalties are parameterless; and one
tems. More concretely, both approaches learn          can drop the standard lexical reordering model at a
cross-lingual n-gram embeddings from monolin-         small cost and do with the distortion model alone,
gual corpora based on the mapping method dis-         which is also parameterless. This way, the main
challenge left is learning the translation model,              likely generating it, and taking the product of their
that is, building the phrase-table.                            respective translation probabilities. The reader is
   Our proposed method starts by building an ini-              referred to Artetxe et al. (2018b) for more details.
tial phrase-table through cross-lingual embedding
mappings (Section 3.1). This initial phrase-table is           3.2   Adding subword information
then extended by incorporating subword informa-                An inherent limitation of existing unsupervised
tion, addressing one of the main limitations of pre-           SMT systems is that words are taken as atomic
vious unsupervised SMT systems (Section 3.2).                  units, making it impossible to exploit character-
Having done that, we adjust the weights of the un-             level information. This is reflected in the known
derlying log-linear model through a novel unsu-                difficulty of these models to translate named en-
pervised tuning procedure (Section 3.3). Finally,              tities, as it is very challenging to discriminate
we further improve the system by jointly refining              among related proper nouns based on distribu-
two models in opposite directions (Section 3.4).               tional information alone, yielding to translation er-
                                                               rors like “Sunday Telegraph” → “The Times of
3.1    Initial phrase-table
                                                               London” (Artetxe et al., 2018b).
So as to build our initial phrase-table, we follow                 So as to overcome this issue, we propose to
Artetxe et al. (2018b) and learn n-gram embed-                 incorporate subword information once the initial
dings for each language independently, map them                alignment is done at the word/phrase level. For
to a shared space through self-learning, and use               that purpose, we add two additional weights to the
the resulting cross-lingual embeddings to extract              initial phrase-table that are analogous to the lexi-
and score phrase pairs.                                        cal weightings, but use a character-level similarity
   More concretely, we train our n-gram embed-                 function instead of word translation probabilities:
dings using phrase2vec1 , a simple extension of                                                               
skip-gram that applies the standard negative sam-                         ¯
                                                                                  Y
                                                                                                         ¯
                                                                   score(f |ē) =    max , max sim(fi , ēj )
pling loss of Mikolov et al. (2013) to bigram-                                                  j
                                                                                  i
context and trigram-context pairs in addition to the
usual word-context pairs.2 Having done that, we                where  = 0.3 guarantees a minimum similarity
map the embeddings to a cross-lingual space us-                score, as we want to favor translation candidates
ing VecMap3 with identical initialization (Artetxe             that are similar at the character level without ex-
et al., 2018a), which builds an initial solution               cessively penalizing those that are not. In our case,
by aligning identical words and iteratively im-                we use a simple similarity function that normal-
proves it through self-learning. Finally, we extract           izes the Levenshtein distance lev(·) (Levenshtein,
translation candidates by taking the 100 nearest-              1966) by the length of the words len(·):
neighbors of each source phrase, and score them
by applying the softmax function over their cosine                                         lev(f, e)
                                                                     sim(f, e) = 1 −
similarities:                                                                          max(len(f ), len(e))

                      exp cos(ē, f¯)/τ
                                         
            ¯                                                  We leave the exploration of more elaborated sim-
         φ(f |ē) = P                 ¯0
                                            
                      f¯0 exp cos(ē, f )/τ
                                                               ilarity functions and, in particular, learnable met-
                                                               rics (McCallum et al., 2005), for future work.
where the temperature τ is estimated using max-
imum likelihood estimation over a dictionary in-               3.3   Unsupervised tuning
duced in the reverse direction. In addition to                 Having trained the underlying statistical models
the phrase translation probabilities in both direc-            independently, SMT tuning aims to adjust the
tions, the forward and reverse lexical weightings              weights of their resulting log-linear combination
are also estimated by aligning each word in the tar-           to optimize some evaluation metric like BLEU in a
get phrase with the one in the source phrase most              parallel validation corpus, which is typically done
   1
     https://github.com/artetxem/                              through Minimum Error Rate Training or MERT
phrase2vec                                                     (Och, 2003). Needless to say, this cannot be done
   2
     So as to keep the model size within a reasonable limit,   in strictly unsupervised settings, but we argue that
we restrict the vocabulary to the most frequent 200,000 uni-
grams, 400,000 bigrams and 400,000 trigrams.                   it would still be desirable to optimize some un-
   3
     https://github.com/artetxem/vecmap                        supervised criterion that is expected to correlate
well with test performance. Unfortunately, nei-                    penalizes excessively long translations:5
ther of the existing unsupervised SMT systems                                                           
do so: Artetxe et al. (2018b) use a heuristic that                                 len(TF →E (TE→F (E)))
                                                                    LP(E) = max 1,
builds two initial models in opposite directions,                                          len(E)
uses one of them to generates a synthetic parallel
corpus through back-translation (Sennrich et al.,                     So as to minimize the combined loss function,
2016), and applies MERT to tune the model in                       we adapt MERT to jointly optimize the param-
the reverse direction, iterating until convergence,                eters of the two models. In its basic form, MERT
whereas Lample et al. (2018b) do not perform any                   approximates the search space for each source
tuning at all. In what follows, we propose a more                  sentence through an n-best list, and performs a
principled approach to tuning that defines an unsu-                form of coordinate descent by computing the op-
pervised criterion and an optimization procedure                   timal value for each parameter through an effi-
that is guaranteed to converge to a local optimum                  cient line search method and greedily taking the
of it.                                                             step that leads to the largest gain. The process
   Inspired by the previous work on CycleGANs                      is repeated iteratively until convergence, augment-
(Zhu et al., 2017) and dual learning (He et al.,                   ing the n-best list with the updated parameters at
2016), our method takes two initial models in op-                  each iteration so as to obtain a better approxima-
posite directions, and defines an unsupervised op-                 tion of the full search space. Given that our opti-
timization objective that combines a cyclic con-                   mization objective combines two translation sys-
sistency loss and a language model loss over the                   tems TF →E (TE→F (E)), this would require gen-
two monolingual corpora E and F :                                  erating an n-best list for TE→F (E) first and, for
                                                                   each entry on it, generating a new n-best list with
                                                                   TF →E , yielding a combined n-best list with N 2
L = Lcycle (E) + Lcycle (F ) + Llm (E) + Llm (F )                  entries. So as to make it more efficient, we pro-
                                                                   pose an alternating optimization approach where
   The cyclic consistency loss captures the intu-                  we fix the parameters of one model and optimize
ition that the translation of a translation should be              the other with standard MERT. Thanks to this, we
close to the original text. So as to quantify this, we             do not need to expand the search space of the fixed
take a monolingual corpus in the source language,                  model, so we can do with an n-best list of N en-
translate it to the target language and back to the                tries alone. Having done that, we fix the parame-
source language, and compute its BLEU score tak-                   ters of the opposite model and optimize the other,
ing the original text as reference:                                iterating until convergence.

                                                                   3.4    Joint refinement
Lcycle (E) = 1 − BLEU(TF →E (TE→F (E)), E)
                                                                   Constrained by the lack of parallel corpora, the
   At the same time, the language model loss cap-                  procedure described so far makes important sim-
tures the intuition that machine translation should                plifications that could compromise its potential
produce fluent text in the target language. For that               performance: its phrase-table is somewhat unnatu-
purpose, we estimate the per-word entropy in the                   ral (e.g. the translation probabilities are estimated
target language corpus using an n-gram language                    from cross-lingual embeddings rather than actual
model, and penalize higher per-word entropies in                   frequency counts) and it lacks a lexical reordering
machine translated text as follows:4                               model altogether. So as to overcome this issue, ex-
                                                                   isting unsupervised SMT methods generate a syn-
Llm (E) = LP · max(0, H(F ) − H(TE→F (E)))2                        thetic parallel corpus through back-translation and
                                                                   use it to train a standard SMT system from scratch,
where the length penalty LP = LP(E) · LP(F )                       iterating until convergence.
   4                                                                  5
     We initially tried to directly minimize the entropy of the         Without this penalization, the system tended to produce
generated text, but this worked poorly in our preliminary ex-      unnecessary tokens (e.g. quotes) that looked natural in their
periments. More concretely, the behavior of the optimization       context, which served to minimize the per-word perplexity
algorithm was very unstable, as it tended to excessively focus     of the output. Minimizing the overall perplexity instead of
on either the cyclic consistency loss or the language model        the per-word perplexity did not solve the problem, as the op-
loss at the cost of the other, and we found it very difficult to   posite phenomenon arose (i.e. the system tended to produce
find the right balance between the two factors.                    excessively short translations).
An obvious drawback of this approach is that                   forming SMT by a large margin in standard bench-
the back-translated side will contain ungrammati-                 marks. As such, the choice of SMT over NMT
cal n-grams that will end up in the induced phrase-               also imposes a hard ceiling on the potential per-
table. One could argue that this should be innocu-                formance of these approaches, as unsupervised
ous as long as the ungrammatical n-grams are in                   SMT systems inherit the very same limitations
the source side, as they should never occur in real               of their supervised counterparts (e.g. the local-
text and their corresponding entries in the phrase-               ity and sparsity problems). For that reason, we
table should therefore not be used. However, un-                  argue that SMT provides a more appropriate ar-
grammatical source phrases do ultimately affect                   chitecture to find an initial alignment between the
the estimation of the backward translation prob-                  languages, but NMT is ultimately a better archi-
abilities, including those of grammatical phrases.                tecture to model the translation process.
For instance, let’s say that the target phrase “dos                  Following this observation, we propose a hybrid
gatos” has been aligned 10 times with “two cats”                  approach that uses unsupervised SMT to warm up
and 90 times with “two cat”. While the un-                        a dual NMT model trained through iterative back-
grammatical phrase-table entry two cat- dos gatos                 translation. More concretely, we first train two
should never be picked, the backward probability                  SMT systems in opposite directions as described
estimation of two cats - dos gatos is still affected              in Section 3, and use them to assist the training of
by it (it would be 0.1 instead of 1.0 in this exam-               another two NMT systems in opposite directions.
ple).                                                             These NMT systems are trained following an it-
   We argue that, ultimately, the backward prob-                  erative process where, at each iteration, we alter-
ability estimations can only be meaningful when                   nately update the model in each direction by per-
all source phrases are grammatical (so the prob-                  forming a single pass over a synthetic parallel cor-
abilities of all plausible translations sum to one)               pus built through back-translation (Sennrich et al.,
and, similarly, the forward probability estimations               2016).7 In the first iteration, the synthetic parallel
can only be meaningful when all target phrases are                corpus is entirely generated by the SMT system in
grammatical. Following this observation, we pro-                  the opposite direction but, as training progresses
pose an alternative approach that jointly refines                 and the NMT models get better, we progressively
both translation directions. More concretely, we                  switch to a synthetic parallel corpus generated by
use the initial systems to build two synthetic cor-               the reverse NMT model. More concretely, itera-
pora in opposite directions.6 Having done that, we                tion t uses Nsmt = N · max(0, 1 − t/a) syn-
independently extract phrase pairs from each syn-                 thetic parallel sentences from the reverse SMT
thetic corpus, and build a phrase-table by taking                 system, where the parameter a controls the num-
their intersection. The forward probabilities are                 ber of transition iterations from SMT to NMT
estimated in the parallel corpus with the synthetic               back-translation. The remaining N − Nsmt sen-
source side, while the backward probabilities are                 tences are generated by the reverse NMT model.
estimated in the one with the synthetic target side.              Inspired by Edunov et al. (2018), we use greedy
This does not only guarantee that the probability                 decoding for half of them, which produces more
estimates are meaningful as discussed previously,                 fluent and predictable translations, and random
but it also discards the ungrammatical phrases al-                sampling for the other half, which produces more
together, as both the source and the target n-grams               varied translations. In our experiments, we use
must have occurred in the original monolingual                    N = 1, 000, 000 and a = 30, and perform a to-
texts to be present in the resulting phrase-table.                tal of 60 such iterations. At test time, we use beam
We repeat this process for a total of 3 iterations.               search decoding with an ensemble of all check-
                                                                  points from every 10 iterations.
4       NMT hybridization
                                                                  5       Experiments and results
While the rigid and modular design of SMT pro-
vides a very suitable framework for unsupervised                  In order to make our experiments comparable to
machine translation, NMT has shown to be a fairly                 previous work, we use the French-English and
superior paradigm in supervised settings, outper-
                                                                      7
                                                                       Note that we do not train a new model from scratch each
    6
     For efficiency purposes, we restrict the size of each syn-   time, but continue training the model from the previous iter-
thetic parallel corpus to 10 million sentence pairs.              ation.
WMT-14                    WMT-16
                                                          fr-en    en-fr   de-en   en-de    de-en   en-de
                            Artetxe et al. (2018c)         15.6    15.1    10.2     6.6       -       -
                            Lample et al. (2018a)          14.3    15.1     -        -       13.3    9.6
                   NMT
                            Yang et al. (2018)             15.6    17.0     -        -       14.6    10.9
                            Lample et al. (2018b)          24.2    25.1     -        -       21.0    17.2
                            Artetxe et al. (2018b)         25.9    26.2    17.4    14.1      23.1    18.2
                            Lample et al. (2018b)          27.2    28.1     -       -        22.9    17.9
                   SMT      Marie and Fujita (2018)∗        -       -       -       -        20.2    15.5
                            Proposed system                28.4    30.1    20.1    15.8      25.4    19.7
                              detok. SacreBLEU ∗           27.9    27.8    19.7    14.7      24.8    19.4
                            Lample et al. (2018b)          27.7    27.6     -       -        25.2    20.2
                   SMT      Marie and Fujita (2018)∗        -       -       -       -        26.7    20.0
                    +       Ren et al. (2019)              28.9    29.5    20.4    17.0      26.3    21.7
                   NMT      Proposed system                33.5    36.2    27.0    22.5      34.4    26.9
                              detok. SacreBLEU ∗           33.2    33.6    26.4    21.2      33.8    26.4

Table 1: Results of the proposed method in comparison to previous work (BLEU). Overall best results are in bold,
the best ones in each group are underlined.
∗
  Detokenized BLEU equivalent to the official mteval-v13a.pl script. The rest use tokenized BLEU with
multi-bleu.perl (or similar).

German-English datasets from the WMT 2014                           English. Following common practice, we re-
shared task. More concretely, our training data                     port tokenized BLEU scores as computed by the
consists of the concatenation of all News Crawl                     multi-bleu.perl script included in Moses.
monolingual corpora from 2007 to 2013, which                        In addition to that, we also report detokenized
make a total of 749 million tokens in French, 1,606                 BLEU scores as computed by SacreBLEU11
millions in German, and 2,109 millions in English,                  (Post, 2018), which is equivalent to the official
from which we take a random subset of 2,000                         mteval-v13a.pl script.
sentences for tuning (Section 3.3). Preprocessing                     We next present the results of our proposed sys-
is done using standard Moses tools, and involves                    tem in comparison to previous work in Section
punctuation normalization, tokenization with ag-                    5.1. Section 5.2 then compares the obtained re-
gressive hyphen splitting, and truecasing.                          sults to those of different supervised systems. Fi-
   Our SMT implementation is based on Moses8 ,                      nally, Section 5.3 presents some translation exam-
and we use the KenLM (Heafield et al., 2013)                        ples from our system.
tool included in it to estimate our 5-gram language
model with modified Kneser-Ney smoothing. Our                       5.1    Main results
unsupervised tuning implementation is based on                      Table 1 reports the results of the proposed sys-
Z-MERT (Zaidan, 2009), and we use FastAlign                         tem in comparison to previous work. As it can be
(Dyer et al., 2013) for word alignment within the                   seen, our full system obtains the best published re-
joint refinement procedure. Finally, we use the big                 sults in all cases, outperforming the previous state-
transformer implementation from fairseq9 for our                    of-the-art by 5-7 BLEU points in all datasets and
NMT system, training with a total batch size of                     translation directions.
20,000 tokens across 8 GPUs with the exact same                        A substantial part of this improvement comes
hyperparameters as Ott et al. (2018).                               from our more principled unsupervised SMT ap-
   We use newstest2014 as our test set for                          proach, which outperforms all previous SMT-
French-English, and both newstest2014 and new-                      based systems by around 2 BLEU points. Nev-
stest2016 (from WMT 201610 ) for German-                            ertheless, it is the NMT hybridization that brings
                                                                    the largest gains, improving the results of this ini-
   8
       http://www.statmt.org/moses/                                 tial SMT systems by 5-9 BLEU points. As shown
   9
       https://github.com/pytorch/fairseq
    10                                                                 11
       Note that it is only the test set that is from WMT 2016.           SacreBLEU signature: BLEU+case.mixed+lang.LANG
All the training data comes from WMT 2014 News Crawl, so            +numrefs.1+smooth.exp+test.TEST+tok.13a+version.1.2.1
it is likely that our results could be further improved by using    1, with LANG ∈ {fr-en, en-fr, de-en, en-de} and TEST ∈
the more extensive monolingual corpora from WMT 2016.               {wmt14/full, wmt16}
WMT-14                            WMT-16
                                                       fr-en            en-fr             de-en         en-de
                                       Initial SMT     27.2             28.1              22.9          17.9
             Lample et al. (2018b)
                                       + NMT hybrid    27.7 (+0.5)      27.6 (-0.5)       25.2 (+2.3)   20.2 (+2.3)
                                       Initial SMT     -                -                 20.2          15.5
             Marie and Fujita (2018)
                                       + NMT hybrid    -                -                 26.7 (+6.5)   20.0 (+4.5)
                                       Initial SMT     28.4             30.1              25.4          19.7
             Proposed system
                                       + NMT hybrid    33.5 (+5.1)      36.2 (+6.1)       34.4 (+9.0)   26.9 (+7.2)

       Table 2: NMT hybridization results for different unsupervised machine translation systems (BLEU).

                                                                                  WMT-14
                                                                   fr-en     en-fr     de-en   en-de
                                        Proposed system               33.5      36.2   27.0    22.5
                        Unsupervised
                                          detok. SacreBLEU ∗          33.2      33.6   26.4    21.2
                                        WMT best∗                     35.0      35.8   29.0    20.6†
                        Supervised      Vaswani et al. (2017)          -        41.0    -      28.4
                                        Edunov et al. (2018)           -        45.6    -      35.0

Table 3: Results of the proposed method in comparison to different supervised systems (BLEU).
∗
  Detokenized BLEU equivalent to the official mteval-v13a.pl script. The rest use tokenized BLEU with
multi-bleu.perl (or similar).
†
  Results in the original test set from WMT 2014, which slightly differs from the full test set used in all subsequent
work. Our proposed system obtains 22.4 BLEU points (21.1 detokenized) in that same subset.

in Table 2, our absolute gains are considerably                 include the best results from the shared task itself,
larger than those of previous hybridization meth-               which reflect the state-of-the-art in machine trans-
ods, even if our initial SMT system is substan-                 lation back in 2014; those of Vaswani et al. (2017),
tially better and thus more difficult to improve                who introduced the now predominant transformer
upon. This way, our initial SMT system is about                 architecture; and those of Edunov et al. (2018),
4-5 BLEU points above that of Marie and Fujita                  who apply back-translation at a large scale and
(2018), yet our absolute gain on top of it is around            hold the current best results in the test set.
2.5 BLEU points higher. When compared to Lam-                      As it can be seen, our unsupervised system out-
ple et al. (2018b), we obtain an absolute gain of 5-            performs the WMT 2014 shared task winner in
6 BLEU points in both French-English directions                 English-to-German, and is around 2 BLEU points
while they do not get any clear improvement, and                behind it in the other translation directions. This
we obtain an improvement of 7-9 BLEU points in                  shows that unsupervised machine translation is al-
both German-English directions, in contrast with                ready competitive with the state-of-the-art in su-
the 2.3 BLEU points they obtain.                                pervised machine translation in 2014. While the
   More generally, it is interesting that pure SMT              field of machine translation has undergone great
systems perform better than pure NMT systems,                   progress in the last 5 years, and the gap between
yet the best results are obtained by initializing an            our unsupervised system and the current state-of-
NMT system with an SMT system. This suggests                    the-art in supervised machine translation is still
that the rigid and modular architecture of SMT                  large as reflected by the other results, this suggests
might be more suitable to find an initial alignment             that unsupervised machine translation can be a us-
between the languages, but the final system should              able alternative in practical settings.
be ultimately based on NMT for optimal results.
                                                                5.3     Qualitative results
5.2   Comparison with supervised systems                        Table 4 shows some translation examples from our
So as to put our results into perspective, Table 3 re-          proposed system in comparison to those reported
ports the results of different supervised systems in            by Artetxe et al. (2018b). We choose the exact
the same WMT 2014 test set. More concretely, we                 same sentences reported by Artetxe et al. (2018b),
Source                              Reference                      Artetxe et al. (2018b)          Proposed system
 D’autres révélations ont fait       Other revelations cited doc-   Other disclosures have re-      Other     revelations    have
 état de documents divulgués         uments leaked by Snow-         ported documents disclosed      pointed to documents dis-
 par Snowden selon lesquels          den that the NSA moni-         by Snowden suggested the        closed by Snowden that
 la NSA avait intercepté des         tored German Chancellor        NSA      had     intercepted    the NSA had intercepted
 données et des communica-           Angela Merkel’s cellphone      communications and data         data and communications
 tions émanant du téléphone          and those of up to 34 other    from the mobile phone of        emanating from German
 portable de la chancelière alle-    world leaders.                 German Chancellor Angela        Chancellor Angela Merkel’s
 mande Angela Merkel et de                                          Merkel and those of 32          mobile phone and those of
 ceux de 34 autres chefs d’État.                                    other heads of state.           34 other heads of state.
 La NHTSA n’a pas pu ex-             NHTSA could not review         The NHTSA could not con-        NHTSA said it could not ex-
 aminer la lettre d’information      the owner notification let-    sider the letter of informa-    amine the letter of informa-
 aux propriétaires en raison de      ter due to the 16-day gov-     tion to owners because of       tion to owners because of the
 l’arrêt de 16 jours des activités   ernment shutdown, which        halting 16-day government       16-day halt in government
 gouvernementales, ce qui a          tempered auto sales growth     activities, which slowed the    operations, which slowed ve-
 ralenti la croissance des ventes    in October.                    growth in vehicle sales in      hicle sales growth in Octo-
 de véhicules en octobre.                                           October.                        ber.
 Le M23 est né d’une mu-             The M23 was born of an         M23 began as a mutiny in        The M23 was born into a
 tinerie,   en avril 2012,           April 2012 mutiny by for-      April 2012, former rebels,      mutiny in April 2012, of for-
 d’anciens rebelles, essen-          mer rebels, principally Tut-   mainly Tutsi integrated into    mer rebels, mostly Tutsi, em-
 tiellement tutsi, intégrés dans     sis who were integrated        the national army in 2009       bedded in the army in 2009
 l’armée en 2009 après un            into the army in 2009 fol-     after a peace deal.             after a peace deal.
 accord de paix.                     lowing a peace agreement.
 Tunks a déclaré au Sun-             Tunks told Sydney’s Sun-       Tunks told The Times of         Tunks told the Sunday Tele-
 day Telegraph de Sydney que         day Telegraph the whole        London from Sydney that         graph in Sydney that the
 toute la famille était «extrême-    family was “extremely con-     the whole family was “ex-       whole family was “extremely
 ment préoccupée» du bien-           cerned” about his daugh-       tremely concerned” of the       concerned” about her daugh-
 être de sa fille et voulait         ter’s welfare and wanted       welfare of her daughter and     ter’s well-being and wanted
 qu’elle rentre en Australie.        her back in Australia.         wanted it to go in Australia.   her to go into Australia.

Table 4: Randomly chosen translation examples from French→English newstest2014 in comparison of those re-
ported by Artetxe et al. (2018b).

which were randomly taken from newstest2014,                        them by incorporating subword information, us-
so they should be representative of the general be-                 ing a theoretically well founded unsupervised tun-
havior of both systems.                                             ing method, and developing a joint refinement pro-
   While not perfect, our proposed system pro-                      cedure. In addition to that, we use our improved
duces generally fluent translations that accurately                 SMT approach to initialize a dual NMT model
capture the meaning of the original text. Just in                   that is further improved through on-the-fly back-
line with our quantitative results, this suggests that              translation. Our experiments show the effective-
unsupervised machine translation can be a usable                    ness of our approach, as we improve the previous
alternative in practical settings.                                  state-of-the-art in unsupervised machine transla-
   Compared to Artetxe et al. (2018b), our transla-                 tion by 5-7 BLEU points in French-English and
tions are generally more fluent, which is not sur-                  German-English WMT 2014 and 2016.
prising given that they are produced by an NMT
system rather than an SMT system. In addition to
that, the system of Artetxe et al. (2018b) has some
adequacy issues when translating named entities
and numerals (e.g. 34 → 32, Sunday Telegraph →                         In the future, we would like to explore learn-
The Times of London), which we do not observe                       able similarity functions like the one proposed by
for our proposed system in these examples.                          (McCallum et al., 2005) to compute the character-
                                                                    level scores in our initial phrase-table. In addition
6   Conclusions and future work                                     to that, we would like to incorporate a language
                                                                    modeling loss during NMT training similar to He
In this paper, we identify several deficiencies in                  et al. (2016). Finally, we would like to adapt our
previous unsupervised SMT systems, and pro-                         approach to more relaxed scenarios with multiple
pose a more principled approach that addresses                      languages and/or small parallel corpora.
Acknowledgments                                            space models for improved decipherment. In Pro-
                                                           ceedings of the 53rd Annual Meeting of the Associ-
This research was partially supported by the Span-         ation for Computational Linguistics and the 7th In-
ish MINECO (UnsupNMT TIN2017-91692-EXP,                    ternational Joint Conference on Natural Language
cofunded by EU FEDER), the UPV/EHU (excel-                 Processing (Volume 1: Long Papers), pages 836–
                                                           845, Beijing, China. Association for Computational
lence research group), and the NVIDIA GPU grant            Linguistics.
program. Mikel Artetxe enjoys a doctoral grant
from the Spanish MECD.                                   Chris Dyer, Victor Chahuneau, and Noah A. Smith.
                                                           2013. A simple, fast, and effective reparameteriza-
                                                           tion of ibm model 2. In Proceedings of the 2013
                                                           Conference of the North American Chapter of the
References                                                 Association for Computational Linguistics: Human
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.       Language Technologies, pages 644–648, Atlanta,
  Learning bilingual word embeddings with (almost)         Georgia. Association for Computational Linguistics.
  no bilingual data. In Proceedings of the 55th Annual
                                                         Sergey Edunov, Myle Ott, Michael Auli, and David
  Meeting of the Association for Computational Lin-
                                                           Grangier. 2018. Understanding back-translation at
  guistics (Volume 1: Long Papers), pages 451–462,
                                                           scale. In Proceedings of the 2018 Conference on
  Vancouver, Canada. Association for Computational
                                                           Empirical Methods in Natural Language Process-
  Linguistics.
                                                           ing, pages 489–500, Brussels, Belgium. Association
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.             for Computational Linguistics.
  2018a. A robust self-learning method for fully un-     Hany Hassan, Anthony Aue, Chang Chen, Vishal
  supervised cross-lingual mappings of word embed-         Chowdhary, Jonathan Clark, Christian Feder-
  dings. In Proceedings of the 56th Annual Meeting of      mann, Xuedong Huang, Marcin Junczys-Dowmunt,
  the Association for Computational Linguistics (Vol-      William Lewis, Mu Li, et al. 2018. Achieving hu-
  ume 1: Long Papers), pages 789–798. Association          man parity on automatic chinese to english news
  for Computational Linguistics.                           translation. arXiv preprint arXiv:1803.05567.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.           Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,
  2018b. Unsupervised statistical machine transla-         Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn-
  tion. In Proceedings of the 2018 Conference on           ing for machine translation. In Advances in Neural
  Empirical Methods in Natural Language Process-           Information Processing Systems 29, pages 820–828.
  ing, pages 3632–3642, Brussels, Belgium. Associ-
  ation for Computational Linguistics.                   Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
                                                           Clark, and Philipp Koehn. 2013. Scalable modified
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and             kneser-ney language model estimation. In Proceed-
  Kyunghyun Cho. 2018c. Unsupervised neural ma-            ings of the 51st Annual Meeting of the Association
  chine translation. In Proceedings of the 6th Inter-      for Computational Linguistics (Volume 2: Short Pa-
  national Conference on Learning Representations          pers), pages 690–696, Sofia, Bulgaria. Association
  (ICLR 2018).                                             for Computational Linguistics.
Alexis Conneau, Guillaume Lample, Marc’Aurelio           Guillaume Lample, Alexis Conneau, Ludovic De-
  Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018.         noyer, and Marc’Aurelio Ranzato. 2018a. Un-
  Word translation without parallel data. In Proceed-      supervised machine translation using monolingual
  ings of the 6th International Conference on Learning     corpora only. In Proceedings of the 6th Inter-
  Representations (ICLR 2018).                             national Conference on Learning Representations
                                                           (ICLR 2018).
Qing Dou and Kevin Knight. 2012. Large scale deci-
  pherment for out-of-domain machine translation. In     Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
  Proceedings of the 2012 Joint Conference on Empir-       dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.
  ical Methods in Natural Language Processing and          Phrase-based & neural unsupervised machine trans-
  Computational Natural Language Learning, pages           lation. In Proceedings of the 2018 Conference on
  266–275, Jeju Island, Korea. Association for Com-        Empirical Methods in Natural Language Process-
  putational Linguistics.                                  ing, pages 5039–5049, Brussels, Belgium. Associ-
                                                           ation for Computational Linguistics.
Qing Dou and Kevin Knight. 2013. Dependency-based
  decipherment for resource-limited machine transla-     Vladimir I Levenshtein. 1966. Binary codes capable
  tion. In Proceedings of the 2013 Conference on Em-       of correcting deletions, insertions, and reversals. In
  pirical Methods in Natural Language Processing,          Soviet physics doklady, volume 10, pages 707–710.
  pages 1668–1676, Seattle, Washington, USA. Asso-
  ciation for Computational Linguistics.                 Benjamin Marie and Atsushi Fujita. 2018. Unsuper-
                                                           vised neural machine translation initialized by un-
Qing Dou, Ashish Vaswani, Kevin Knight, and Chris          supervised statistical machine translation. arXiv
  Dyer. 2015. Unifying bayesian inference and vector       preprint arXiv:1810.12703.
Andrew McCallum, Kedar Bellare, and Fernando               Omar Zaidan. 2009. Z-mert: A fully configurable open
  Pereira. 2005. A conditional random field for             source tool for minimum error rate training of ma-
  discriminatively-trained finite-state string edit dis-    chine translation systems. The Prague Bulletin of
  tance. In Proceedings of the Twenty-First Confer-         Mathematical Linguistics, 91:79–88.
  ence on Uncertainty in Artificial Intelligence, pages
  388–395.                                                 Jun-Yan Zhu, Taesung Park, Phillip Isola, and
                                                             Alexei A. Efros. 2017. Unpaired image-to-image
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-         translation using cycle-consistent adversarial net-
  rado, and Jeff Dean. 2013. Distributed representa-         works. In The IEEE International Conference on
  tions of words and phrases and their compositional-        Computer Vision (ICCV).
  ity. In Advances in Neural Information Processing
  Systems 26, pages 3111–3119.

Franz Josef Och. 2003. Minimum error rate train-
  ing in statistical machine translation. In Proceed-
  ings of the 41st Annual Meeting of the Association
  for Computational Linguistics, pages 160–167, Sap-
  poro, Japan. Association for Computational Linguis-
  tics.

Myle Ott, Sergey Edunov, David Grangier, and
 Michael Auli. 2018. Scaling neural machine trans-
 lation. In Proceedings of the Third Conference on
 Machine Translation: Research Papers, pages 1–9,
 Belgium, Brussels. Association for Computational
 Linguistics.

Matt Post. 2018. A call for clarity in reporting bleu
 scores. In Proceedings of the Third Conference on
 Machine Translation: Research Papers, pages 186–
 191, Belgium, Brussels. Association for Computa-
 tional Linguistics.

Sujith Ravi and Kevin Knight. 2011. Deciphering for-
  eign language. In Proceedings of the 49th Annual
  Meeting of the Association for Computational Lin-
  guistics: Human Language Technologies, pages 12–
  21, Portland, Oregon, USA. Association for Compu-
  tational Linguistics.

Shuo Ren, Zhirui Zhang, Shujie Liu, Ming Zhou,
  and Shuai Ma. 2019. Unsupervised neural ma-
  chine translation with smt as posterior regulariza-
  tion. arXiv preprint arXiv:1901.04112.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Improving neural machine translation mod-
  els with monolingual data. In Proceedings of the
  54th Annual Meeting of the Association for Compu-
  tational Linguistics (Volume 1: Long Papers), pages
  86–96, Berlin, Germany. Association for Computa-
  tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems, pages 6000–6010.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.
  2018. Unsupervised neural machine translation with
  weight sharing. In Proceedings of the 56th Annual
  Meeting of the Association for Computational Lin-
  guistics (Volume 1: Long Papers), pages 46–55. As-
  sociation for Computational Linguistics.
You can also read