Neural Machine Translation with Key-Value Memory-Augmented Attention - IJCAI

Page created by Deborah Green
 
CONTINUE READING
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

     Neural Machine Translation with Key-Value Memory-Augmented Attention

    Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang
                                       Tencent AI Lab
        {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com

                          Abstract                                       to indicate whether a source word is translated or not. Meng
                                                                         et al. [2016] and Zheng et al. [2018] take the idea one step
     Although attention-based Neural Machine Transla-                    further, and directly model translated and untranslated source
     tion (NMT) has achieved remarkable progress in                      contents by operating on the attention context (i.e., the partial
     recent years, it still suffers from issues of repeating             source content being translated) instead of on the attention
     and dropping translations. To alleviate these issues,               probability (i.e., the chance of the corresponding source word
     we propose a novel key-value memory-augmented                       is translated). Specifically, Meng et al. [2016] capture trans-
     attention model for NMT, called KVM EM ATT.                         lation status with an interactive attention augmented with a
     Specifically, we maintain a timely updated key-                     NTM [Graves et al., 2014] memory. Zheng et al. [2018]
     memory to keep track of attention history and a                     separate the modeling of translated (PAST) and untranslated
     fixed value-memory to store the representation of                   (F UTURE) source content from decoder states by introducing
     source sentence throughout the whole translation                    two additional decoder adaptive layers.
     process. Via nontrivial transformations and itera-
     tive interactions between the two memories, the de-                    Meng et al. [2016] propose a generic framework of
     coder focuses on more appropriate source word(s)                    memory-augmented attention, which is independent from the
     for predicting the next target word at each decoding                specific architectures of the NMT models. However, the orig-
     step, therefore can improve the adequacy of trans-                  inal mechanism takes only a single memory to both represent
     lations. Experimental results on Chinese⇒English                    the source sentence and track attention history. Such over-
     and WMT17 German⇔English translation tasks                          loaded usage of memory representations makes training the
     demonstrate the superiority of the proposed model.                  model difficult [Rocktäschel et al., 2017]. In contrast, Zheng
                                                                         et al. [2018] try to ease the difficulty of representation learn-
                                                                         ing by separating the PAST and F UTURE functions from the
1   Introduction                                                         decoder states. However, it is designed specifically for the
The past several years have witnessed promising progress                 precise architecture of NMT models.
in Neural Machine Translation (NMT) [Cho et al., 2014;                      In this work, we combine the advantages of both models by
Sutskever et al., 2014], in which attention model plays                  leveraging the generic memory-augmented attention frame-
an increasingly important role [Bahdanau et al., 2015; Lu-               work, while easing the memory training by maintaining sep-
ong et al., 2015; Vaswani et al., 2017]. Conventional                    arate representations for the expected two functions. Partially
attention-based NMT encodes the source sentence as a se-                 inspired by [Miller et al., 2016], we split the memory into
quence of vectors after bi-directional recurrent neural net-             two parts: a dynamical key-memory along with the update-
works (RNN) [Schuster and Paliwal, 1997], and then gener-                chain of the decoder state to keep track of attention history,
ates a variable-length target sentence with another RNN and              and a fixed value-memory to store the representation of source
an attention mechanism. The attention mechanism plays a                  sentence throughout the whole translation process. In each
crucial role in NMT, as it shows which source word(s) the                decoding step, we conduct multi-rounds of memory opera-
decoder should focus on in order to predict the next tar-                tions repeatedly layer by layer, which may let the decoder
get word. However, there is no mechanism to effectively                  have a chance of re-attention by considering the “intermedi-
keep track of attention history in conventional attention-based          ate” attention results achieved in early stages. This structure
NMT. This may let the decoder tend to ignore past atten-                 allows the model to leverage possibly complex transforma-
tion information, and lead to the issues of repeating or drop-           tions and interactions between 1) the key-value memory pair
ping translations [Tu et al., 2016]. For example, conventional           in the same layer, as well as 2) the key (and value) memory
attention-based NMT may repeatedly translate some source                 across different layers.
words while mistakenly ignore other words.                                  Experimental results on Chinese⇒English translation task
   A number of recent efforts have explored ways to allevi-              show that attention model augmented with a single-layer key-
ate the inadequate translation problem. For example, Tu et               value memory improves both translation and attention perfor-
al. [2016] employ coverage vector as a lexical-level indicator           mances over not only a standard attention model, but also

                                                                  2574
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

over the existing NTM-augmented attention model [Meng                                                     Decoder
et al., 2016]. Its multi-layer counterpart further improves                                                             Yt-1                  Yt               Yt+1
                                                                                                              GRU                  GRU                GRU
model performances consistently. We also validate our model                                               …                                                           …
on bidirectional German⇔English translation tasks, which
                                                                                                                        St-1                  St               St+1
demonstrates the effectiveness and generalizability of our ap-
proach.                                                                                                   …                 read    write      read    write          …

                                                                                                     H0   …          Ht-1                Ht              Ht+1             …
2    Background
                                                                                  Encoder
Attention-based NMT Given a source sentence x =
                                                                                            hj-1     hj       hj+1
{x1 , x2 , · · · , xn } and a target sentence y={y1 , y2 , · · · , ym },           …                                    …
                                                                                            hj-1     hj       hj+1
NMT models the translation probability word by word:
                            Ym                                                     …        Xj-1    Xj        Xj+1      …

           p(y|x) =             P (yt |y
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

                                                                                      ct                 y1        y2      …               ym
                                                                                                                     Softmax

                        Key-Memory         Value-Memory                            Update

                                                                qt
                                                                                   Softmax
                                                                                                                          st-1
                 Encoder

                                                                            …
                                                                …

                                                                                               …
                                                                                   Update                                 yt-1
                      h1          h2      …         hn

                                                                qt
                      h1          h2      …         hn                 Key-Memory Softmax Value-Memory
                                                                                                         Decoder
                       x1          x2     …          xn                                                                        …
                                                                               st-1 yt-1
                                                                                                                y1          …           ym-1

Figure 2: The architecture of KVM EM ATT-based NMT. The encoder (on the left) and the decoder (on the right) are connected by the
KVM EM ATT (in the middle). In particular, we show the KVM EM ATT at decoding time step t with multi-rounds of memory access.

to generate the “intermediate” attention vector ãrt                         step. Clearly, KVM EM ATT can subsume the coverage mod-
                                              (r−1)
                                                                             els [Tu et al., 2016; Mi et al., 2016] and the interactive atten-
              ãrt = Address(qt , K(t)                )          (7)         tion model [Meng et al., 2016] as special cases, while more
                                                                             generic and powerful, as empirically verified in the experi-
which is subsequently used as the guidance for reading from                  ment section.
the value memory V to get the “intermediate” context repre-
sentation c̃rt of source sentence                                            3.1       Memory Access Operations
                      c̃rt   =   Read(ãrt , V)                  (8)         In this section, we will detail the memory access operations
                                                                             from round r-1 to round r at decoding time step t.
which works together with the “query” state qt are used to                                                                             (r)
get the “intermediate” hidden state:                                         Key-Memory Addressing Formally, K(t) ∈ Rn×d is the
                                                                             K EY-M EMORY in round r at decoding time step t before the
                      s̃rt = GRU(qt , c̃rt )                     (9)         decoder RNN state update, where n is the number of mem-
Finally, we use the “intermediate” hidden state to update                    ory slots and d is the dimension of vector in each slot. The
   (r−1)                                                                     addressed attention vector is given by
K(t)      as recording the “intermediate” attention status, to                                                                 (r−1)
finish a round of operations,                                                                    ãrt = Address(qt , K(t)              )          (11)
                (r)                           (r−1)                          where      ãrt       n
                                                                                         ∈ R specifies the normalized weights assigned to
             K(t) = Update(s̃rt , K(t)                )         (10)                        (r−1)                             (r−1)
                                                                             the slots in K(t) , with the j-th slot being k(t,j) . We can
After the last round (R) of the operations, we use                           use content-based addressing to determine ãrt as described
{s̃R     R     R
   t , c̃t , ãt } as the resulting states to compute a final pre-           in [Graves et al., 2014] or (quite similarly) use the soft-
                                                      (R)                    alignment as in Eq. 4. In this paper, for convenience we adopt
diction via Eq. 1. Then the K EY-M EMORY K(t) will be
                                                          (0)
                                                                             the latter one. And the j-th cell of ãrt is
transited to the next decoding step t+1, being K(t+1) . The
                                                                                                          exp(ẽrt,j )
details of Address, Read and Update operations will be                                        ãrt,j = Pn             r                (12)
described later in next section.                                                                         i=1 exp(ẽt,i )
   As seen, the KVM EM ATT mechanism can update the                          where ẽrt,j = var T tanh(War qt + Ura k(t,j) ).
                                                                                                                                   (r−1)
K EY-M EMORY along with the update-chain of the decoder
state to keep track of attention status and also maintains a                 Value-Memory Reading Formally, V ∈ Rn×d is the
fixed VALUE -M EMORY to store the representation of source                   VALUE -M EMORY, where n is the number of memory slots
sentence. At each decoding step, the KVM EM ATT generates                    and d is the dimension of vector in each slot. Before the de-
the context representation of source sentence via nontrivial                 coder state update at time t, the output of reading at round r
transformations between the K EY-M EMORY and the VALUE -                     is c̃rt given by
M EMORY, and records the attention status via interactions be-                                          n
                                                                                                        X
tween the two memories. This structure allows the model to                                       c̃rt =     ãrt,j vj                  (13)
leverage possibly complex transformations and interactions                                                     j=1
between two memories, and lets the decoder choose more                       where ãrt ∈ Rn specifies the normalized weights assigned to
appropriate source context for the word prediction at each                   the slots in V.

                                                                     2576
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Key-Memory Updating Inspired by the attentive writing                        4     Experiments
operation of neural turing machines [Graves et al., 2014],
we define two types of operation for updating the K EY-                      4.1    Setup
M EMORY: F ORGET and A DD.                                                   We carry out experiments on Chinese⇒English (Zh⇒En)
   F ORGET determines the content to be removed from mem-                    and German⇔English (De⇔En) translation tasks. For
ory slots. More specifically, the vector Frt ∈ Rd specifies                  Zh⇒En, the training data consist of 1.25M sentence pairs ex-
the values to be forgotten or removed on each dimension in                   tracted from LDC corpora. We choose NIST 2002 (MT02)
memory slots, which is then assigned to each slot through                    dataset as our valid set, and NIST 2003-2006 (MT03-06)
normalized weights wtr . Formally, the memory (“intermedi-                   datasets as our test sets. For De⇔En, we perform our exper-
ate”) after F ORGET operation is given by                                    iments on the corpus provided by WMT17, which contains
      (r)      (r−1)       r                                                 5.6M sentence pairs. We use newstest2016 as the develop-
   k̃t,i = kt,i (1 − wt,i     · Frt ), i = 1, 2, · · · , n   (14)
                                                                             ment set, and newstest2017 as the testset. We measure the
where                                                                        translation quality with BLEU scores [Papineni et al., 2002].1
  • Frt = σ(WFr , s̃rt ) is parameterized with WFr ∈ Rd×d ,                     In training the neural networks, we limit the source and tar-
     and σ stands for the Sigmoid activation function, and                   get vocabulary to the most frequent 30K words for both sides
     Frt ∈ Rd ;                                                              in Zh⇒En task, covering approximately 97.7% and 99.3% of
  • wtr ∈ Rn specifies the normalized weights assigned to                    two corpus respectively. For De⇔En, sentences are encoded
                    (r)         r
     the slots in K(t) , and wt,i    specifies the weight associ-            using byte-pair encoding [Sennrich et al., 2016], which has a
                                                                             shared source-target vocabulary of about 36000 tokens. The
     ated with the i-th slot. wtr is determined by                           parameters are updated by SGD and mini-batch (size 80) with
                                              (r−1)
                  wtr = Address(s̃rt , K(t)           )           (15)       learning rate controlled by AdaDelta [Zeiler, 2012] ( = 1e−6
                                                                             and ρ = 0.95). We limit the length of sentences in training
  A DD decides how much current information should be                        to 80 words for Zh⇒En and 100 sub-words for De⇔En. The
written to the memory as the added content                                   dimension of word embedding and hidden layer is 512, and
        (r)     (r)   r
      kt,i = k̃t,i + wt,i · Art ,         i = 1, 2, · · · , n     (16)       the beam size in testing is 10. We apply dropout on the output
                                                                             layer to avoid over-fitting [Hinton et al., 2012], with dropout
where Art = σ(WA
               r
                 , s̃rt ) is parameterized with WA
                                                 r
                                                   ∈ Rd×d ,
      r     d
                                                                             rate being 0.5. Hyper parameter λ in Eq. 19 is set to 1.0. The
and At ∈ R . Clearly, with F ORGET and A DD operations,                      parameters of our KVM EM ATT (i.e., encoder and decoder,
KVM EM ATT potentially can modify and add to the K EY-                       except for those related to KVM EM ATT) are initialized by
M EMORY more than just history of attention.                                 the pre-trained baseline model.
3.2     Training
                                                                             4.2    Results on Chinese-English
EOS-attention Objective: The translation process is fin-
ished when the decoder generates eostrg (a special token that                We compare our KVM EM ATT with two strong baselines:
stands for the end of target sentence). Therefore, accurately                1) RNNS EARCH, which is our in-house implementation of
generating eostrg is crucial for producing correct translation.              the attention-based NMT as described in Section 2; and
Intuitively, accurate attention for the last word (i.e. eossrc )             2) RNNS EARCH +M EM ATT, which is our implementation
of source sentence will help the decoder accurately predict                  of interactive attention [Meng et al., 2016] on top of our
eostrg . When to predict eostrg (i.e. ym ), the decoder should               RNNS EARCH. Table 1 shows the translation performance.
highly focus on eossrc (i.e. xn ). That is to say the atten-
                                                                             Model Complexity KVM EM ATT brings in little parame-
tion probability of am,n should be closed to 1.0. And when
                                                                             ter increase. Compared with RNN SEARCH, KVM EM ATT-
to generate other target words, such as y1 , y2 , · · · , ym−1 , the
                                                                             R1 and KVM EM ATT-R2 only bring in 1.95% and 7.78%
decoder should not focus on eossrc too much. Therefore we
                                                                             parameter increase. Additionally, introducing KVM EM ATT
define
                                                                             slows down the training speed to a certain extent
                               m                                             (18.39%∼39.56%). When running on a single GPU device
                                                                             Tesla P40, the speed of the RNN SEARCH model is 2773 tar-
                               X
                ATTEOS =             Atteos (t, n)                (17)
                               t=1
                                                                             get words per second, while the speed of the proposed models
where                                                                        is 1676∼2263 target words per second.
                                
                                    at,n ,           t
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

   S YSTEM                    A RCHITECTURE             # Para. Speed      MT03                MT04        MT05        MT06      AVE .
                                                  Existing end-to-end NMT systems
   [Tu et al., 2016]          C OVERAGE                    –         –      33.69               38.05      35.01       34.83     35.40
   [Meng et al., 2016]        M EM ATT                     –         –      35.69               39.24      35.74       35.10     36.44
   [Wang et al., 2016]        M EM D EC                    –         –      36.16               39.81      35.91       35.98     36.95
   [Zhang et al., 2017]       D ISTORTION                  –         –      37.93               40.40      36.81       35.77     37.73
                                                    Our end-to-end NMT systems
                              RNNS EARCH                53.98M 2773         36.02                38.65      35.64      35.80     36.53
                                 + M EM ATT             55.03M 2319       36.93*+              40.06*+    36.81*+     37.39*+    37.80
                                 + KVM EM ATT-R1 55.03M 2263              37.56*+              40.56*+    37.83*+     37.84*+    38.45
   this work
                                    +ATT E OS O BJ      55.03M 1992       37.87*+              40.64+     38.01+      38.13*+    38.66
                                 + KVM EM ATT-R2 58.18M 1804               38.08+               40.73+     38.09+     38.80*+    38.93
                                    + ATT E OS O BJ     58.18M 1676       38.40*+              41.10*+    38.73*+     39.08*+    39.33

Table 1: Case-insensitive BLEU scores (%) on Chinese⇒English translation task. “Speed” denotes the training speed (words/second). “R1”
and “R2” denotes KVM EM ATT with one and two rounds, respectively. “+ATT E OS O BJ” stands for adding the EOS-attention objective. The
“*” and “+” indicate that the results are significantly (p
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

           S YSTEM                    A RCHITECTURE                                                        D E⇒E N      E N⇒D E
                                                 Existing end-to-end NMT systems
           [Rikters et al., 2017]     cGRU + dropout + named entity forcing + synthetic data                29.00           22.70
           [Escolano et al., 2017]    Char2Char + rescoring with inverse model + synthetic data             28.10           21.20
           [Sennrich et al., 2017]    cGRU + synthetic data                                                 32.00           26.10
           [Tu et al., 2016]          RNNS EARCH + C OVERAGE                                                28.70           23.60
           [Zheng et al., 2018]       RNNS EARCH + PAST-F UTURE -L AYERS                                    29.70           24.30
                                                   Our end-to-end NMT systems
                                      RNN SEARCH                                                            29.33           23.83
           this work
                                         + M EM ATT                                                         30.13           24.62
                                         + KVM EM ATT-R2 + ATT E OS O BJ                                    30.98           25.39

Table 4: Case-sensitive BLEU scores (%) on German⇔English translation task. “synthetic data” denotes additional 10M monolingual
sentences, which is not used in this work.

we can see that, compared with the baseline system, our ap-               distortion knowledge into the attention-based NMT. [Tu et
proach can decline the percentages of source words which                  al., 2016; Mi et al., 2016] proposed coverage mechanisms to
are under-translated from 13.1% to 9.7% and which are over-               encourage the decoder to consider more untranslated source
translated from 2.7% to 1.3%. The main reason is that our                 words during translation. These works are different from
KVM EM ATT can keep track of attention status and generate                our KVM EM ATT, since we use a rather generic key-value
more appropriate source context for predicting the next target            memory-augmented framework with memory access (i.e. ad-
word at each decoding step.                                               dress, read and update).
                                                                             Our work is also related to recent efforts on attaching a
4.3      Results on German-English                                        memory to neural networks [Graves et al., 2014] and ex-
We also evaluate our model on the WMT17 benchmarks on                     ploiting memory [Tang et al., 2016; Wang et al., 2016;
bidirectional German⇔English translation tasks, as listed in              Feng et al., 2017; Meng et al., 2016; Wang et al., 2017] dur-
Table 4. Our baseline achieves even higher BLEU scores to                 ing translation. [Tang et al., 2016] exploited a phrase mem-
the state-of-the-art NMT systems of WMT17, which do not                   ory stored in symbolic form for NMT. [Wang et al., 2016] ex-
use additional synthetic data.2 Our proposed model consis-                tended the NMT decoder by maintaining an external memory,
tently outperforms two strong baselines (i.e., standard and               which is operated by reading and writing operations. [Feng et
memory-augmented attention models) on both De⇒En and                      al., 2017] proposed a neural-symbolic architecture, which ex-
En⇒De translation tasks. These results demonstrate that our               ploits a memory to provide knowledge for infrequently used
model works well across different language pairs.                         words. Our work differ at that we augment attention with a
                                                                          specially designed interactive key-value memory, which al-
5       Related Work                                                      lows the model to leverage possibly complex transformations
                                                                          and interactions between the two memories via single- or
Our work is inspired by the key-value memory net-                         multi-rounds of memory access in each decoding step.
works [Miller et al., 2016] originally proposed for the ques-
tion answering, and has been successfully applied to ma-                  6   Conclusion and Future Work
chine translation [Gu et al., 2017; Gehring et al., 2017;
                                                                          We propose an effective KVM EM ATT model for NMT,
Vaswani et al., 2017; Tu et al., 2018]. In these works, both the
                                                                          which maintains a timely updated key-memory to track at-
key-memory and value-memory are fixed during translation.
                                                                          tention history and a fixed value-memory to store the repre-
Different from these works, we update the K EY-M EMORY
                                                                          sentation of source sentence during translation. Via nontriv-
along with the update-chain of the decoder state via attentive
                                                                          ial transformations and iterative interactions between the two
writing operations (e.g. F ORGET and A DD).
                                                                          memories, our KVM EM ATT can focus on more appropriate
   Our work is related to recent studies that focus on design-
                                                                          source context for predicting the next target word at each de-
ing better attention models [Luong et al., 2015; Cohn et al.,
                                                                          coding step. Additionally, to further enhance the attention,
2016; Feng et al., 2016; Tu et al., 2016; Mi et al., 2016;
                                                                          we propose a simple yet effective attention-oriented objec-
Zhang et al., 2017]. [Luong et al., 2015] proposed to use
                                                                          tive in a weakly supervised manner. Our empirical study on
a global attention to attend to all source words and a local
                                                                          Chinese⇒English, German⇒English and English⇒German
attention model to look at a subset of source words. [Cohn
                                                                          translation tasks shows that KVM EM ATT can significantly
et al., 2016] extended the attention-based NMT to include
                                                                          improve the performance of NMT.
structural biases from word-based alignment models. [Feng
et al., 2016] added implicit distortion and fertility models                 For future work, we will consider to explore more rounds
to attention-based NMT. [Zhang et al., 2017] incorporated                 of memory access with more powerful operations on key-
                                                                          value memories to further enhance the attention. Another in-
    2
   [Sennrich et al., 2017] obtains better BLEU scores than our            teresting direction is to apply the proposed approach to Trans-
model, since they use large scaled synthetic data (about 10M). It         former [Vaswani et al., 2017], in which the attention model
maybe unfair to compare our model to theirs directly.                     plays a more important role.

                                                                   2579
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

References                                                               [Papineni et al., 2002] Kishore Papineni, Salim Roukos,
[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun                         Todd Ward, and Wei-Jing Zhu. Bleu: a method for au-
   Cho, and Yoshua Bengio. Neural machine translation by                    tomatic evaluation of machine translation. In ACL, 2002.
   jointly learning to align and translate. In ICLR, 2015.               [Rikters et al., 2017] Matı̄ss Rikters, Chantal Amrhein,
[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer,                     Maksym Del, and Mark Fishel. C-3ma: Tartu-riga-zurich
   Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and                     translation systems for wmt17. In WMT, 2017.
   Yoshua Bengio. Learning phrase representations using                  [Rocktäschel et al., 2017] Tim Rocktäschel,       Johannes
   rnn encoder-decoder for statistical machine translation. In              Welbl, and Sebastian Riedel. Frustratingly short attention
   EMNLP, 2014.                                                             spans in neural language modeling. In ICLR, 2017.
[Cohn et al., 2016] Trevor Cohn, Cong Duy Vu Hoang, Eka-                 [Schuster and Paliwal, 1997] Mike Schuster and Kuldip K
   terina Vymolova, Kaisheng Yao, Chris Dyer, and Gholam-                   Paliwal. Bidirectional recurrent neural networks. TSP,
   reza Haffari. Incorporating structural alignment biases into             1997.
   an attentional neural translation model. In NAACL, 2016.              [Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and
[Escolano et al., 2017] Carlos Escolano, Marta R. Costa-                    Alexandra Birch. Neural machine translation of rare words
   jussà, and José A. R. Fonollosa. The talp-upc neural ma-               with subword units. In ACL, 2016.
   chine translation system for german/finnish-english using             [Sennrich et al., 2017] Rico Sennrich, Alexandra Birch,
   the inverse direction model in rescoring. In WMT, 2017.
                                                                            Anna Currey, Ulrich Germann, Barry Haddow, Ken-
[Feng et al., 2016] Shi Feng, Shujie Liu, Mu Li, and Ming                   neth Heafield, Antonio Valerio Miceli Barone, and Philip
   Zhou. Implicit distortion and fertility models for attention-            Williams. The university of edinburgh’s neural MT sys-
   based encoder-decoder NMT model. In COLING, 2016.                        tems for WMT17. arXiv, 2017.
[Feng et al., 2017] Yang Feng, Shiyue Zhang, Andi Zhang,                 [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
   Dong Wang, and Andrew Abel. Memory-augmented neu-                        Quoc VV Le. Sequence to sequence learning with neu-
   ral machine translation. arXiv, 2017.                                    ral networks. In NIPS, 2014.
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David                [Tang et al., 2016] Yaohua Tang, Fandong Meng, Zheng-
   Grangier, Denis Yarats, and Yann N. Dauphin. Convo-                      dong Lu, Hang Li, and Philip L. H. Yu. Neural machine
   lutional sequence to sequence learning. arXiv, 2017.                     translation with external phrase memory. arXiv, 2016.
[Graves et al., 2014] Alex Graves, Greg Wayne, and Ivo                   [Tu et al., 2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu,
   Danihelka. Neural turing machines. arXiv, 2014.                          Xiaohua Liu, and Hang Li. Modeling coverage for neu-
[Gu et al., 2017] Jiatao Gu, Yong Wang, Kyunghyun Cho,                      ral machine translation. In ACL, 2016.
   and Victor OK Li. Search engine guided non-parametric                 [Tu et al., 2018] Zhaopeng Tu, Yang Liu, Shuming Shi, and
   neural machine translation. arXiv, 2017.                                 Tong Zhang. Learning to remember translation history
[Hinton et al., 2012] Geoffrey E Hinton, Nitish Srivastava,                 with a continuous cache. TACL, 2018.
   Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-               [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
   dinov. Improving neural networks by preventing co-                       Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
   adaptation of feature detectors. arXiv, 2012.                            Lukasz Kaiser, and Illia Polosukhin. Attention is all you
[Liu and Sun, 2015] Yang Liu and Maosong Sun. Con-                          need. In NIPS, 2017.
   trastive unsupervised word alignment with non-local fea-              [Wang et al., 2016] Mingxuan Wang, Zhengdong Lu, Hang
   tures. In AAAI, 2015.                                                    Li, and Qun Liu. Memory-enhanced decoder for neural
[Luong et al., 2015] Thang Luong, Hieu Pham, and Christo-                   machine translation. In EMNLP, 2016.
   pher D. Manning. Effective approaches to attention-based              [Wang et al., 2017] Xing Wang, Zhengdong Lu, Zhaopeng
   neural machine translation. In EMNLP, 2015.                              Tu, Hang Li, Deyi Xiong, and Min Zhang. Neural machine
[Meng et al., 2016] Fandong Meng, Zhengdong Lu, Hang                        translation advised by statistical machine translation. In
   Li, and Qun Liu. Interactive attention for neural machine                AAAI, 2017.
   translation. In COLING, 2016.                                         [Zeiler, 2012] Matthew D Zeiler. Adadelta: an adaptive
[Mi et al., 2016] Haitao Mi, Baskaran Sankaran, Zhiguo                      learning rate method. arXiv, 2012.
   Wang, and Abe Ittycheriah. Coverage embedding models                  [Zhang et al., 2017] Jinchao Zhang, Mingxuan Wang, Qun
   for neural machine translation. In EMNLP, 2016.
                                                                            Liu, and Jie Zhou. Incorporating word reordering knowl-
[Miller et al., 2016] Alexander Miller, Adam Fisch, Jesse                   edge into attention-based neural machine translation. In
   Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason                    ACL, 2017.
   Weston. Key-value memory networks for directly reading
                                                                         [Zheng et al., 2018] Zaixiang Zheng, Hao Zhou, Shujian
   documents. In EMNLP, 2016.
                                                                            Huang, Lili Mou, Dai Xinyu, Jiajun Chen, and Zhaopeng
[Och and Ney, 2003] Franz Josef Och and Hermann Ney.                        Tu. Modeling past and future for neural machine transla-
   A systematic comparison of various statistical alignment                 tion. TACL, 2018.
   models. CL, 2003.

                                                                  2580
You can also read