Towards Robust Neural Machine Translation - Association for ...

Page created by Zachary Klein
 
CONTINUE READING
Towards Robust Neural Machine Translation
        Yong Cheng? , Zhaopeng Tu? , Fandong Meng? , Junjie Zhai? and Yang Liu†
                                   ?
                                     Tencent AI Lab, China
                †
                  State Key Laboratory of Intelligent Technology and Systems
          Beijing National Research Center for Information Science and Technology
     Department of Computer Science and Technology, Tsinghua University, Beijing, China
                Beijing Advanced Innovation Center for Language Resources
                              chengyong3001@gmail.com
                {zptu, fandongmeng, jasonzhai}@tencent.com
                           liuyang2011@tsinghua.edu.cn

                       Abstract                                 Input      tamen bupa kunnan zuochu weiqi AI.
                                                                           They are not afraid of difficulties to
    Small perturbations in the input can                        Output
                                                                           make Go AI.
    severely distort intermediate representa-
                                                                Input      tamen buwei kunnan zuochu weiqi AI.
    tions and thus impact translation quality of
                                                                Output     They are not afraid to make Go AI.
    neural machine translation (NMT) mod-
    els. In this paper, we propose to improve                 Table 1: The non-robustness problem of neural
    the robustness of NMT models with adver-                  machine translation. Replacing a Chinese word
    sarial stability training. The basic idea is              with its synonym (i.e., “bupa” → “buwei”) leads to
    to make both the encoder and decoder in                   significant erroneous changes in the English trans-
    NMT models robust against input pertur-                   lation. Both “bupa” and “buwei” can be translated
    bations by enabling them to behave sim-                   to the English phrase “be not afraid of.”
    ilarly for the original input and its per-
    turbed counterpart. Experimental results
    on Chinese-English, English-German and                       However, studies reveal that very small changes
    English-French translation tasks show that                to the input can fool state-of-the-art neural net-
    our approaches can not only achieve sig-                  works with high probability (Goodfellow et al.,
    nificant improvements over strong NMT                     2015; Szegedy et al., 2014). Belinkov and Bisk
    systems but also improve the robustness of                (2018) confirm this finding by pointing out that
    NMT models.                                               NMT models are very brittle and easily falter
                                                              when presented with noisy input. In NMT, due
1   Introduction
                                                              to the introduction of RNN and attention, each
Neural machine translation (NMT) models have                  contextual word can influence the model predic-
advanced the state of the art by building a sin-              tion in a global context, which is analogous to the
gle neural network that can better learn represen-            “butterfly effect.” As shown in Table 1, although
tations (Cho et al., 2014; Sutskever et al., 2014).           we only replace a source word with its synonym,
The neural network consists of two components:                the generated translation has been completely dis-
an encoder network that encodes the input sen-                torted. We investigate severe variations of trans-
tence into a sequence of distributed representa-              lations caused by small input perturbations by re-
tions, based on which a decoder network generates             placing one word in each sentence of a test set with
the translation with an attention model (Bahdanau             its synonym. We observe that 69.74% of transla-
et al., 2015; Luong et al., 2015). A variety of NMT           tions have changed and the BLEU score is only
models derived from this encoder-decoder frame-               79.01 between the translations of the original in-
work have further improved the performance of                 puts and the translations of the perturbed inputs,
machine translation systems (Gehring et al., 2017;            suggesting that NMT models are very sensitive to
Vaswani et al., 2017). NMT is capable of general-             small perturbations in the input. The vulnerabil-
izing better to unseen text by exploiting word simi-          ity and instability of NMT models limit their ap-
larities in embeddings and capturing long-distance            plicability to a broader range of tasks, which re-
reordering by conditioning on larger contexts in a            quire robust performance on noisy inputs. For ex-
continuous way.                                               ample, simultaneous translation systems use auto-

                                                         1756
 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1756–1766
               Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
matic speech recognition (ASR) to transcribe in-      tem (Gehring et al., 2017). Related experimen-
put speech into a sequence of hypothesized words,     tal analyses validate that our training approach can
which are subsequently fed to a translation sys-      improve the robustness of NMT models.
tem. In this pipeline, ASR errors are presented as
sentences with noisy perturbations (the same pro-     2   Background
nunciation but incorrect words), which is a signif-   NMT is an end-to-end framework which directly
icant challenge for current NMT models. More-         optimizes the translation probability of a target
over, instability makes NMT models sensitive to       sentence y = y1 , ..., yN given its corresponding
misspellings and typos in text translation.           source sentence x = x1 , ..., xM :
   In this paper, we address this challenge with                           N
                                                                           Y
adversarial stability training for neural machine           P (y|x; θ) =         P (yn |y
Linv(x, x’)      Ltrue(x, y)      Lnoisy(x’, y)            encoder, which benefits outputting the same
                                                                   translations. We cast this objective in the ad-
      Discriminator                  Decoder                       versarial learning framework.

                                                                 • Lnoisy (x0 , y) to guide the decoder to generate
                Hx              Hx’                                output y given the noisy input x0 , which is
                                                                   modeled as − log P (y|x0 ). It can also be de-
                                                                   fined as KL divergence between P (y|x) and
                      Encoder                                      P (y|x0 ) that indicates using P (y|x) to teach
                                                                   P (y|x0 ).
                 x              x’                           As seen, the two introduced objectives aim to im-
                 +perturbations                              prove the robustness of the NMT model which can
                                                             be free of high variances in target outputs caused
Figure 1: The architecture of NMT with adversar-             by small perturbations in inputs. It is also natural
ial stability training. The dark solid arrow lines           to introduce the original training objective L(x, y)
represent the forward-pass information flow for              on x and y, which can guarantee good transla-
the input sentence x, while the red dashed arrow             tion performance while keeping the stability of the
lines for the noisy input sentence x0 , which is             NMT model.
transformed from x by adding small perturbations.               Formally, given a training corpus S, the adver-
                                                             sarial stability training objective is

3     Approach                                                          J (θ)
                                                                         X 
3.1    Overview                                                    =              Ltrue (x, y; θenc , θdec )
The goal of this work is to propose a general ap-                       hx,yi∈S
                                                                               X
proach to make NMT models learned to be more                            +α               Linv (x, x0 ; θenc , θdis )
robust to input perturbations. Our basic idea is                             x0 ∈N (x)
to maintain the consistency of behaviors through                               X                                      
the NMT model for the source sentence x and its                         +β               Lnoisy (x0 , y; θenc , θdec ) (4)
                                                                             x0 ∈N (x)
perturbed counterpart x0 . As aforementioned, the
NMT model contains two procedures for project-
                                                             where Ltrue (x, y) and Lnoisy (x0 , y) are calculated
ing a source sentence x to its target sentence y:
                                                             using Equation 3, and Linv (x, x0 ) is the adversar-
the encoder is responsible for encoding x as a se-
                                                             ial loss to be described in Section 3.3. α and β
quence of representations Hx , while the decoder
                                                             control the balance between the original transla-
outputs y with Hx as input. We aim at learning
                                                             tion task and the stability of the NMT model. θ =
the perturbation-invariant encoder and decoder.
                                                             {θenc , θdec , θdis } are trainable parameters of the
   Figure 1 illustrates the architecture of our ap-
                                                             encoder, decoder, and the newly introduced dis-
proach. Given a source sentence x, we construct a
                                                             criminator used in adversarial learning. As seen,
set of perturbed sentences N (x), in which each
                                                             the parameters of encoder θenc and decoder θdec
sentence x0 is constructed by adding small per-
                                                             are trained to minimize both the translation loss
turbations to x. We require that x0 is a subtle
                                                             Ltrue (x, y) and the stability losses (Lnoisy (x0 , y)
variation from x and they have similar semantics.
                                                             and Linv (x, x0 )).
Given the input pair (x, x0 ), we have two expecta-
tions: (1) the encoded representation Hx0 should                Since Lnoisy (x0 , y) evaluates the translation
be close to Hx ; and (2) given Hx0 , the decoder is          loss on the perturbed neighbour x0 and its corre-
able to generate the robust output y. To this end,           sponding target sentence y, it means that we aug-
we introduce two additional objectives to improve            ment the training data by adding perturbed neigh-
the robustness of the encoder and decoder:                   bours, which can potentially improve the transla-
                                                             tion performance. In this way, our approach not
    • Linv (x, x0 ) to encourage the encoder to out-         only makes the output of NMT models more ro-
      put similar intermediate representations Hx            bust, but also improves the performance on the
      and Hx0 for x and x0 to achieve an invariant           original translation task.

                                                          1758
In the following sections, we will first describe     to improve the robustness of common NMT mod-
how to construct perturbed inputs with different         els. In practice, one can design specific strategies
strategies to fulfill different goals (Section 3.2),     for particular tasks. For example, we can replace
followed by the proposed adversarial learning            correct words with their homonyms (same pronun-
mechanism for the perturbation-invariant encoder         ciation but different meanings) to improve NMT
(Section 3.3). We conclude this section with the         models for simultaneous translation systems.
training strategy (Section 3.4).
                                                         3.3   Adversarial Learning for the
3.2    Constructing Perturbed Inputs                           Perturbation-invariant Encoder
At each training step, we need to generate a per-        The goal of the perturbation-invariant encoder is
turbed neighbour set N (x) for each source sen-          to make the representations produced by the en-
tence x for adversarial stability training. In this      coder indistinguishable when fed with a correct
paper, we propose two strategies to construct the        sentence x and its perturbed counterpart x0 , which
perturbed inputs at multiple levels of representa-       is directly beneficial to the output robustness of
tions.                                                   the decoder. We cast the problem in the adversar-
   The first approach generates perturbed neigh-         ial learning framework (Goodfellow et al., 2014).
bours at the lexical level. Given an input sentence      The encoder serves as the generator G, which de-
x, we randomly sample some word positions to             fines the policy that generates a sequence of hid-
be modified. Then we replace words at these posi-        den representations Hx given an input sentence x.
tions with other words in the vocabulary according       We introduce an additional discriminator D to dis-
to the following distribution:                           tinguish the representation of perturbed input Hx0
                                                         from that of the original input Hx . The goal of
                   exp {cos (E[xi ], E[x])}              the generator G (i.e., encoder) is to produce sim-
P (x|xi ) = P                                      (5)
                 x∈Vx \xi exp {cos (E[xi ], E[x])}
                                                         ilar representations for x and x0 which could fool
                                                         the discriminator, while the discriminator D tries
where E[xi ] is the word embedding for word xi ,         to correctly distinguish the two representations.
Vx \xi is the source vocabulary set excluding the           Formally, the adversarial learning objective is
word xi , and cos (E[xi ], E[x]) measures the simi-
larity between word xi and x. Thus we can change                    Linv (x, x0 ; θenc , θdis )
the word to another word with similar semantics.                = Ex∼S [− log D(G(x))] +
                                                                  Ex0 ∼N (x) − log(1 − D(G(x0 ))) (7)
                                                                                                
   One potential problem of the above strategy is
that it is hard to enumerate all possible positions
and possible types to generate perturbed neigh-          The discriminator outputs a classification score
bours. Therefore, we propose a more general ap-          given an input representation, and tries to max-
proach to modifying the sentence at the feature          imize D(G(x)) to 1 and minimize D(G(x0 )) to
level. Given a sentence, we can obtain the word          0. The objective encourages the encoder to output
embedding for each word. We add the Gaussian             similar representations for x and x0 , so that the
noise to a word embedding to simulate possible           discriminator fails to distinguish them.
types of perturbations. That is                             The training procedure can be regarded as a
                                                         min-max two-player game. The encoder parame-
      E[x0i ] = E[xi ] + ,    ∼ N(0, σ 2 I)    (6)     ters θenc are trained to maximize the loss function
                                                         to fool the discriminator. The discriminator pa-
where the vector  is sampled from a Gaussian dis-       rameters θdis are optimized to minimize this loss
tribution with variance σ 2 . σ is a hyper-parameter.    for improving the discriminating ability. For ef-
We simply introduce Gaussian noise to all of word        ficiency, we update both the encoder and the dis-
embeddings in x.                                         criminator simultaneously at each iteration, rather
   The proposed scheme is a general framework            than the periodical training strategy that is com-
where one can freely define the strategies to con-       monly used in adversarial learning. Lamb et al.
struct perturbed inputs. We just present two pos-        (2016) also propose a similar idea to use Professor
sible examples here. The first strategy is poten-        Forcing to make the behaviors of RNNs be indis-
tially useful when the training data contains noisy      tinguishable when training and sampling the net-
words, while the latter is a more general strategy       works.

                                                    1759
3.4    Training                                        40K and 30K respectively for Chinese-English,
As shown in Figure 1, our training objective in-       English-German, and English-French. We re-
cludes three sets of model parameters for three        port the case-sensitive tokenized BLEU score for
modules. We use mini-batch stochastic gradient         English-German and English-French and the case-
descent to optimize our model. In the forward          insensitive tokenized BLEU score for Chinese-
pass, besides a mini-batch of x and y, we also         English.
construct a mini-batch consisting of the perturbed        Our baseline system is an in-house NMT sys-
neighbour x0 and y. We propagate the informa-          tem. Following Bahdanau et al. (2015), we im-
tion to calculate these three loss functions accord-   plement an RNN-based NMT in which both the
ing to arrows. Then, gradients are collected to up-    encoder and decoder are two-layer RNNs with
date three sets of model parameters. Except for        residual connections between layers (He et al.,
the gradients of Linv with respect to θenc are mul-    2016b). The gating mechanism of RNNs is gated
tiplying by −1, other gradients are normally back-     recurrent unit (GRUs) (Cho et al., 2014). We
propagated. Note that we update θinv and θenc si-      apply layer normalization (Ba et al., 2016) and
multaneously for training efficiency.                  dropout (Hinton et al., 2012) to the hidden states
                                                       of GRUs. Dropout is also added to the source and
4     Experiments                                      target word embeddings. We share the same ma-
                                                       trix between the target word embeedings and the
4.1    Setup
                                                       pre-softmax linear transformation (Vaswani et al.,
We evaluated our adversarial stability training on     2017). We update the set of model parameters us-
translation tasks of several language pairs, and re-   ing Adam SGD (Kingma and Ba, 2015). Its learn-
ported the 4-gram BLEU (Papineni et al., 2002)         ing rate is initially set to 0.05 and varies according
score as calculated by the multi-bleu.perl script.     to the formula in Vaswani et al. (2017).
Chinese-English We used the LDC corpus con-               Our adversarial stability training initializes the
sisting of 1.25M sentence pairs with 27.9M Chi-        model based on the parameters trained by maxi-
nese words and 34.5M English words respectively.       mum likelihood estimation (MLE). We denote ad-
We selected the best model using the NIST 2006         versarial stability training based on lexical-level
set as the validation set (hyper-parameter opti-       perturbations and feature-level perturbations re-
mization and model selection). The NIST 2002,          spectively as ASTlexical and ASTfeature . We only
2003, 2004, 2005, and 2008 datasets are used as        sample one perturbed neighbour x0 ∈ N (x) for
test sets.                                             training efficiency. For the discriminator used in
English-German We used the WMT 14 corpus               Linv , we adopt the CNN discriminator proposed
containing 4.5M sentence pairs with 118M En-           by Kim (2014) to address the variable-length prob-
glish words and 111M German words. The vali-           lem of the sequence generated by the encoder. In
dation set is newstest2013, and the test set is new-   the CNN discriminator, the filter windows are set
stest2014.                                             to 3, 4, 5 and rectified linear units are applied af-
English-French We used the IWSLT corpus                ter convolution operations. We tune the hyper-
which contains 0.22M sentence pairs with 4.03M         parameters on the validation set through a grid
English words and 4.12M French words. The              search. We find that both the optimal values of
IWLST corpus is very dissimilar from the NIST          α and β are set to 1.0. The standard variance in
and WMT corpora. As they are collected from            Gaussian noise used in the formula (6) is set to
TED talks and inclined to spoken language,             0.01. The number of words that are replaced in
we want to verify our approaches on the non-           the sentence x during lexical-level perturbations is
normative text. The IWSLT 14 test set is taken         taken as max(0.2|x|, 1) in which |x| is the length
as the validation set and 15 test set is used as the   of x. The default beam size for decoding is 10.
test set.
   For English-German and English-French, we           4.2     Translation Results
tokenize both English, German and French words
using tokenize.perl script. We follow Sen-             4.2.1    NIST Chinese-English Translation
nrich et al. (2016b) to split words into sub-          Table 2 shows the results on Chinese-English
word units. The numbers of merge operations            translation. Our strong baseline system signifi-
in byte pair encoding (BPE) are set to 30K,            cantly outperforms previously reported results on

                                                   1760
System                    Training       MT06     MT02      MT03      MT04    MT05     MT08
        Shen et al. (2016)        MRT            37.34    40.36     40.93     41.37   38.81    29.23
        Wang et al. (2017)        MLE            37.29    –         39.35     41.15   38.07    –
        Zhang et al. (2018)       MLE            38.38    –         40.02     42.32   38.84    –
                                  MLE            41.38    43.52     41.50     43.64   41.58    31.60
        this work                 ASTlexical     43.57    44.82     42.95     45.05   43.45    34.85
                                  ASTfeature     44.44    46.10     44.07     45.61   44.06    34.94

                Table 2: Case-insensitive BLEU scores on Chinese-English translation.

            System                             Architecture                   Training     BLEU
            Shen et al. (2016)                 Gated RNN with 1 layer         MRT          20.45
            Luong et al. (2015)                LSTM with 4 layers             MLE          20.90
            Kalchbrenner et al. (2017)         ByteNet with 30 layers         MLE          23.75
            Wang et al. (2017)                 DeepLAU with 4 layers          MLE          23.80
            Wu et al. (2016)                   LSTM with 8 layers             RL           24.60
            Gehring et al. (2017)              CNN with 15 layers             MLE          25.16
            Vaswani et al. (2017)              Self-attention with 6 layers   MLE          28.40
                                                                              MLE          24.06
            this work                          Gated RNN with 2 layers        ASTlexical   25.17
                                                                              ASTfeature   25.26

            Table 3: Case-sensitive BLEU scores on WMT 14 English-German translation.

         Training       tst2014     tst2015                points over Shen et al. (2016), up to +3.51
         MLE            36.92       36.90                  BLEU points over Wang et al. (2017) and up to
         ASTlexical     37.35       37.03                  +2.74 BLEU points over Zhang et al. (2018))
         ASTfeature     38.03       37.64                  and our system trained with MLE across all the
                                                           datasets. Compared with our baseline system,
Table 4: Case-sensitive BLEU scores on IWSLT               ASTlexical achieves +1.75 BLEU improvement on
English-French translation.                                average. ASTfeature performs better, which can
                                                           obtain +2.59 BLEU points on average and up to
                                                           +3.34 BLEU points on NIST08.
Chinese-English NIST datasets trained on RNN-
based NMT. Shen et al. (2016) propose minimum              4.2.2   WMT 14 English-German Translation
risk training (MRT) for NMT, which directly op-            In Table 3, we list existing NMT systems as com-
timizes model parameters with respect to BLEU              parisons. All these systems use the same WMT 14
scores. Wang et al. (2017) address the issue of            English-German corpus. Except that Shen et al.
severe gradient diffusion with linear associative          (2016) and Wu et al. (2016) respectively adopt
units (LAU). Their system is deep with an encoder          MRT and reinforcement learning (RL), other sys-
of 4 layers and a decoder of 4 layers. Zhang et al.        tems all use MLE as training criterion. All the sys-
(2018) propose to exploit both left-to-right and           tems except for Shen et al. (2016) are deep NMT
right-to-left decoding strategies for NMT to cap-          models with no less than four layers. Google’s
ture bidirectional dependencies. Compared with             neural machine translation (GNMT) (Wu et al.,
them, our NMT system trained by MLE outper-                2016) represents a strong RNN-based NMT sys-
forms their best models by around 3 BLEU points.           tem. Compared with other RNN-based NMT sys-
We hope that the strong baseline systems used in           tems except for GNMT, our baseline system with
this work make the evaluation convincing.                  two layers can achieve better performance than
   We find that introducing adversarial stability          theirs.
training into NMT can bring substantial improve-              When training our NMT system with
ments over previous work (up to +3.16 BLEU                 ASTleixcal , significant improvement (+1.11

                                                       1761
Synthetic Type     Training     0 Op.     1 Op.   2 Op.   3 Op.   4 Op.    5 Op.
                                 MLE          41.38     38.86   37.23   35.97   34.61    32.96
              Swap               ASTlexical   43.57     41.18   39.88   37.95   37.02    36.16
                                 ASTfeature   44.44     42.08   40.20   38.67   36.89    35.81
                                 MLE          41.38     37.21   31.40   27.43   23.94    21.03
              Replacement        ASTlexical   43.57     40.53   37.59   35.19   32.56    30.42
                                 ASTfeature   44.44     40.04   35.00   30.54   27.42    24.57
                                 MLE          41.38     38.45   36.15   33.28   31.17    28.65
              Deletion           ASTlexical   43.57     41.89   38.56   36.14   34.09    31.77
                                 ASTfeature   44.44     41.75   39.06   36.16   33.49    30.90

Table 5: Translation results of synthetic perturbations on the validation set in Chinese-English translation.
“1 Op.” denotes that we conduct one operation (swap, replacement or deletion) on the original sentence.

 Source                  zhongguo dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing
 Reference               china’s new management rules for e-banking operations to take effect on march 1
 MLE                     china’s electronic bank rules to be implemented on march 1
                         new rules for business administration of china ’s electronic banking industry
 ASTlexical
                         will come into effect on march 1 .
                         new rules for business management of china ’s electronic banking industry to
 ASTfeature
                         come into effect on march 1
 Perturbed Source        zhongfang dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing
 MLE                     china to implement new regulations on business management
                         the new regulations for the business administrations of the chinese electronics
 ASTlexical
                         bank will come into effect on march 1 .
                         new rules for business management of china’s electronic banking industry to
 ASTfeature
                         come into effect on march 1

Table 6: Example translations of a source sentence and its perturbed counterpart by replacing a Chinese
word “zhongguo” with its synonym “zhongfang.”

BLEU points) can be observed.          ASTfeature        demonstrate that our approach maintains good per-
can obtain slightly better performance. Our              formance on the non-normative text.
NMT system outperforms the state-of-the-art
RNN-based NMT system, GNMT, with +0.66                   4.3    Results on Synthetic Perturbed Data
BLEU point and performs comparably with                  In order to investigate the ability of our training
Gehring et al. (2017) which is based on CNN              approaches to deal with perturbations, we experi-
with 15 layers. Given that our approach can be           ment with three types of synthetic perturbations:
applied to any NMT systems, we expect that
                                                             • Swap: We randomly choose N positions
the adversarial stability training mechanism can
                                                               from a sentence and then swap the chosen
further improve performance upon the advanced
                                                               words with their right neighbours.
NMT architectures. We leave this for future work.
                                                             • Replacement: We randomly replace sam-
4.2.3 IWSLT English-French Translation                         pled words in the sentence with other words.
Table 4 shows the results on IWSLT English-
                                                             • Deletion: We randomly delete N words from
French Translation. Compared with our strong
                                                               each sentence in the dataset.
baseline system trained by MLE, we observe that
our models consistently improve translation per-            As shown in Table 5, we can find that our train-
formance in all datasets. ASTfeature can achieve         ing approaches, ASTlexical and ASTfeature , consis-
significant improvements on the tst2015 although         tently outperform MLE against perturbations on
ASTlexical obtains comparable results. These             all the numbers of operations. This means that our

                                                      1762
Ltrue   Lnoisy   Ladv   BLEU
            √                                                     46
                    ×        ×     41.38                                                                                               AST lexical

            √                √                                                                                                         AST feature

                    ×              41.91                          44
                    √
            ×                ×     42.20
            √       √
                             ×     42.93                          42
            √       √        √
                                   43.57

                                                           BLEU
                                                                  40

Table 7: Ablation study of adversarial stabil-
                                                                  38
ity training ASTlexical on Chinese-English trans-
         √
lation. “ ” means the loss function is included in                36

the training objective while “×” means it is not.
                                                                  34
                                                                       0     20   40        60   80      100       120   140     160   180          200
                                                                                                      Iterations                         × 10 3
approaches have the capability of resisting pertur-
bations. Along with the number of operations in-      Figure 2: BLEU scores of ASTlexical over itera-
creasing, the performance on MLE drops quickly.       tions on Chinese-English validation set.
Although the performance of our approaches also
drops, we can see that our approaches consistently
surpass MLE. In ASTlexical , with 0 operation, the                5
                                                                                                                                        L
                                                                                                                                            noisy
difference is +2.19 (43.57 Vs. 41.38) for all syn-              4.5                                                                     L true

thetic types, but the differences are enlarged to                 4
                                                                                                                                        L inv

+3.20, +9.39, and +3.12 respectively for the three              3.5
types with 5 operations.                                          3
                                                         Cost

   In the Swap and Deletion types, ASTlexical and               2.5
ASTfeature perform comparably after more than
                                                                  2
four operations. Interestingly, ASTlexical per-
                                                                1.5
forms significantly better than both of MLE and
                                                                  1
ASTfeature after more than one operation in the
Replacement type. This is because ASTlexical                    0.5
                                                                       0               50                100                   150           200
trains the model specifically on perturbation data                                                    Iterations                         × 10 3

that is constructed by replacing words, which
                                                      Figure 3: Learning curves of three loss functions,
agrees with the Replacement Type. Overall,
                                                      Ltrue , Linv and Lnoisy over iterations on Chinese-
ASTlexical performs better than ASTfeature against
                                                      English validation set.
perturbations after multiple operations. We spec-
ulate that the perturbation method for ASTlexical
and synthetic type are both discrete and they keep
                                                      study on the Chinese-English translation to under-
more consistent. Table 6 shows example transla-
                                                      stand the importance of these loss functions by
tions of a Chinese sentence and its perturbed coun-
                                                      choosing ASTlexical as an example. As Table 7
terpart.
                                                      shows, if we remove Ladv , the translation perfor-
   These findings indicate that we can construct
                                                      mance decreases by 0.64 BLEU point. However,
specific perturbations for a particular task. For
                                                      when Lnoisy is excluded from the training objec-
example, in simultaneous translation, an auto-
                                                      tive function, it results in a significant drop of 1.66
matic speech recognition system usually generates
                                                      BLEU point. Surprisingly, only using Lnoisy is
wrong words with the same pronunciation of cor-
                                                      able to lead to an increase of 0.88 BLEU point.
rect words, which dramatically affects the quality
of machine translation system. Therefore, we can      4.4.2                BLEU Scores over Iterations
design specific perturbations aiming for this task.   Figure 2 shows the changes of BLEU scores
                                                      over iterations respectively for ASTlexical and
4.4     Analysis
                                                      ASTfeature . They behave nearly consistently. Ini-
4.4.1    Ablation Study                               tialized by the model trained by MLE, their per-
Our training objective function Eq. (4) contains      formance drops rapidly. Then it starts to go up
three loss functions. We perform an ablation          quickly. Compared with the starting point, the

                                                  1763
maximal dropping points reach up to about 7.0           NMT models. Some authors have recently en-
BLEU points. Basically, the curves present the          deavored to achieve zero-shot NMT through trans-
state of oscillation. We think that introducing         ferring knowledge from bilingual corpora of other
random perturbations and adversarial learning can       language pairs (Chen et al., 2017; Zheng et al.,
make the training not very stable like MLE.             2017; Cheng et al., 2017) or monolingual corpora
                                                        (Lample et al., 2018; Artetxe et al., 2018). Our
4.4.3   Learning Curves of Loss Functions               work significantly differs from these work. We do
Figure 3 shows the learning curves of three loss        not resort to any complicated models to generate
functions, Ltrue , Linv and Lnoisy . We can find that   perturbed data and do not depend on extra mono-
their costs of loss functions decrease not steadily.    lingual or bilingual corpora. The way we exploit
Similar to the Figure 2, there still exist oscilla-     is more convenient and easy to implement. We
tions in the learning curves although they do not       focus more on improving the robustness of NMT
change much sharply. We find that Linv converges        models.
to around 0.68 after about 100K iterations, which
indicates that discriminator outputs probability 0.5    6   Conclusion
for both positive and negative samples and it can-
                                                        We have proposed adversarial stability training to
not distinguish them. Thus the behaviors of the
                                                        improve the robustness of NMT models. The ba-
encoder for x and its perturbed neighbour x0 per-
                                                        sic idea is to train both the encoder and decoder
form nearly consistently.
                                                        robust to input perturbations by enabling them to
                                                        behave similarly for the original input and its per-
5   Related Work
                                                        turbed counterpart. We propose two approaches
Our work is inspired by two lines of research: (1)      to construct perturbed data to adversarially train
adversarial learning and (2) data augmentation.         the encoder and stabilize the decoder. Experi-
                                                        ments on Chinese-English, English-German and
Adversarial Learning Generative Adversarial             English-French translation tasks show that the pro-
Network (GAN) (Goodfellow et al., 2014) and             posed approach can improve both the robustness
its related derivative have been widely applied         and translation performance.
in computer vision (Radford et al., 2015; Sali-            As our training framework is not limited to spe-
mans et al., 2016) and natural language process-        cific perturbation types, it is interesting to evalu-
ing (Li et al., 2017; Yang et al., 2018). Previous      ate our approach in natural noise existing in prac-
work has constructed adversarial examples to at-        tical applications, such as homonym in the simul-
tack trained networks and make networks resist          taneous translation system. It is also necessary to
them, which has proved to improve the robust-           further validate our approach on more advanced
ness of networks (Goodfellow et al., 2015; Miy-         NMT architectures, such as CNN-based NMT
ato et al., 2016; Zheng et al., 2016). Belinkov         (Gehring et al., 2017) and Transformer (Vaswani
and Bisk (2018) introduce adversarial examples          et al., 2017).
to training data for character-based NMT models.
In contrast to theirs, adversarial stability training   Acknowledgments
aims to stabilize both the encoder and decoder in
NMT models. We adopt adversarial learning to            We thank the anonymous reviewers for their in-
learn the perturbation-invariant encoder.               sightful comments and suggestions. We also thank
                                                        Xiaoling Li for analyzing experimental results and
Data Augmentation Data augmentation has the             providing valuable examples. Yang Liu is sup-
capability to improve the robustness of NMT mod-        ported by the National Key R&D Program of
els. In NMT, there is a number of work that aug-        China (No. 2017YFB0202204), National Natural
ments the training data with monolingual corpora        Science Foundation of China (No. 61761166008,
(Sennrich et al., 2016a; Cheng et al., 2016; He         No. 61522204), Beijing Advanced Innovation
et al., 2016a; Zhang and Zong, 2016). They all          Center for Language Resources, and the NExT++
leverage complex models such as inverse NMT             project supported by the National Research Foun-
models to generate translation equivalents for          dation, Prime Ministers Office, Singapore under
monolingual corpora. Then they augment the par-         its IRC@Singapore Funding Initiative.
allel corpora with these pseudo corpora to improve

                                                    1764
References                                               Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
                                                           Aaron van den Oord, Alex Graves, and Koray
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and             Kavukcuoglu. 2017. Neural machine translation in
  Kyunghyun Cho. 2018. Unsupervised neural ma-             linear time. In Proceedings of ICML.
  chine translation. In Proceedings of ICLR.
                                                         Yoon Kim. 2014. Convolutional neural networks for
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-        sentence classification. In Proceedings of EMNLP.
   ton. 2016. Layer normalization. arXiv preprint
   arXiv:1607.06450.                                     Diederik P Kingma and Jimmy Ba. 2015. Adam: A
                                                           method for stochastic optimization. In Proceedings
Dzmitry Bahdanau, KyungHyun Cho, and Yoshua
                                                           of ICLR.
  Bengio. 2015. Neural machine translation by jointly
  learning to align and translate. In Proceedings of     Alex M Lamb, Anirudh Goyal ALIAS PARTH
  ICLR.                                                    GOYAL, Ying Zhang, Saizheng Zhang, Aaron C
                                                           Courville, and Yoshua Bengio. 2016. Professor
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic
                                                           forcing: A new algorithm for training recurrent net-
  and natural noise both break neural machine transla-
                                                           works. In Proceedings of NIPS.
  tion. In Proceedings of ICLR.
                                                         Guillaume Lample,        Ludovic Denoyer,   and
Yun Chen, Yang Liu, Yong Cheng, and Victor OK
                                                           Marc’Aurelio Ranzato. 2018.      Unsupervised
  Li. 2017. A teacher-student framework for zero-
                                                           machine translation using monolingual corpora
  resource neural machine translation. In Proceedings
                                                           only. In Proceedings of ICLR.
  of ACL.
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua             Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean,
  Wu, Maosong Sun, and Yang Liu. 2016. Semi-                Alan Ritter, and Dan Jurafsky. 2017. Adversarial
  supervised learning for neural machine translation.       learning for neural dialogue generation. In Proceed-
  In Proceedings of ACL.                                    ings of EMNLP.

Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and        Minh-Thang Luong, Hieu Pham, and Christopher D
  Wei Xu. 2017. Joint training for pivot-based neural      Manning. 2015. Effective approaches to attention-
  machine translation. In Proceedings of IJCAI.            based neural machine translation. In Proceedings of
                                                           EMNLP.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger        Aleksander Madry, Makelov Aleksandar, Schmidt
  Schwenk, and Yoshua Bengio. 2014. Learning               Ludwig, Dimitris Tsipras, and Adrian Vladu. 2018.
  phrase representations using rnn encoder-decoder         Towards deep learning models resistant to adversar-
  for statistical machine translation. In Proceedings      ial attacks. In Proceedings of ICLR.
  of EMNLP.                                              Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,
Jonas Gehring, Michael Auli, David Grangier, Denis         Ken Nakae, and Shin Ishii. 2016. Distributional
  Yarats, and Yann N Dauphin. 2017. Convolutional          smoothing with virtual adversarial training. In Pro-
  sequence to sequence learning. In Proceedings of         ceedings of ICLR.
  ICML.
                                                         Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,           Jing Zhu. 2002. BLEU: a methof for automatic eval-
   Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron       uation of machine translation. In Proceedings of
   Courville, and Yoshua Bengio. 2014. Generative ad-      ACL.
   versarial nets. In Proceedings of NIPS.
                                                         Alec Radford, Luke Metz, and Soumith Chintala.
Ian Goodfellow, Jonathon Shlens, and Christian             2015. Unsupervised representation learning with
   Szegedy. 2015. Explaining and harnessing adver-         deep convolutional generative adversarial networks.
   sarial examples. In Proceedings of ICLR.                arXiv preprint arXiv:1511.06434.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,      Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
  Tie-Yan Liu, and Wei-Ying Ma. 2016a. Dual learn-         Vicki Cheung, Alec Radford, and Xi Chen. 2016.
  ing for machine translation. In Proceedings of NIPS.     Improved techniques for training gans. In Proceed-
                                                           ings of NIPS.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
  Sun. 2016b. Deep residual learning for image recog-    Rico Sennrich, Barry Haddow, and Alexandra Birch.
  nition. In Proceedings of CVPR.                          2016a. Improving nerual machine translation mod-
                                                           els with monolingual data. In Proceedings of ACL.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
  Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.      Rico Sennrich, Barry Haddow, and Alexandra Birch.
  Improving neural networks by preventing co-              2016b. Neural machine translation of rare words
  adaptation of feature detectors. arXiv preprint          with subword units. In Proceedings of ACL.
  arXiv:1207.0580.

                                                     1765
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua             human and machine translation.      arXiv preprint
  Wu, Maosong Sun, and Yang Liu. 2016. Minimum               arXiv:1609.08144.
  risk training for neural machine translation. In Pro-
  ceedings of ACL.                                        Z. Yang, W. Chen, F. Wang, and B. Xu. 2018. Improv-
                                                             ing Neural Machine Translation with Conditional
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.          Sequence Generative Adversarial Nets. In Proceed-
   Sequence to sequence learning with neural net-            ings of NAACL.
   works. In Proceddings of NIPS.
                                                          Jiajun Zhang and Chengqing Zong. 2016. Exploit-
Christian Szegedy, Wojciech Zaremba, Sutskever Ilya,         ing source-side monolingual data in neural machine
  Joan Bruna, Dumitru Erhan, Ian Goodfellow, and             translation. In Proceedings of EMNLP.
  Rob Fergus. 2014. Intriguing properties of neural
  networks. In Proceedings of ICML.                       Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron-
                                                            grong Ji, and Hongji Wang. 2018. Asynchronous
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob            Bidirectional Decoding for Neural Machine Trans-
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz             lation. In Proeedings of AAAI.
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Proceedings of NIPS.                       Hao Zheng, Yong Cheng, and Yang Liu. 2017.
                                                            Maximum expected likelihood estimation for zero-
Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun              resource neural machine translation. In Proceedings
  Liu. 2017. Deep neural machine translation with lin-      of IJCAI.
  ear associative unit. In Proceedings of ACL.
                                                          Stephan Zheng, Yang Song, Thomas Leung, and Ian
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V              Goodfellow. 2016. Improving the robustness of
  Le, Mohammad Norouzi, Wolfgang Macherey,                   deep neural networks via stability training. In Pro-
  Maxim Krikun, et al. 2016. Google’s neural ma-             ceedings of CVPR.
  chine translation system: Bridging the gap between

                                                      1766
You can also read