Modeling Coverage for Non-Autoregressive Neural Machine Translation

Page created by Isaac Kelly
 
CONTINUE READING
Modeling Coverage for Non-Autoregressive
                                                              Neural Machine Translation
                                                                                        Yong Shan, Yang Feng? , Chenze Shao
                                                                                  Key Laboratory of Intelligent Information Processing,
                                                              Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS), Beijing, China
                                                                                University of Chinese Academy of Sciences, Beijing, China
                                                                           shanyong18s@ict.ac.cn,fengyang@ict.ac.cn,shaochenze18z@ict.ac.cn

                                            Abstract—Non-Autoregressive Neural Machine Translation                                           TABLE I
arXiv:2104.11897v1 [cs.CL] 24 Apr 2021

                                         (NAT) has achieved significant inference speedup by generating         A G ERMAN -E NGLISH EXAMPLE OF over-translation AND under-translation
                                                                                                                                                                    :::::::::::
                                         all tokens simultaneously. Despite its high efficiency, NAT usually     ERRORS IN NAT. “S RC ” AND “R EF ” MEAN SOURCE AND REFERENCE ,
                                         suffers from two kinds of translation errors: over-translation (e.g.                             RESPECTIVELY.

                                         repeated tokens) and under-translation (e.g. missing translations),
                                         which eventually limits the translation quality. In this paper,          Src:    Das technische Personal hat mir ebenfalls viel gegeben.
                                         we argue that these issues of NAT can be addressed through               Ref:    The technical staff has also brought me a lot.
                                         coverage modeling, which has been proved to be useful in                 NAT:    The technical technical staff has has (also) brought me a lot.
                                                                                                                                                                 :::
                                         autoregressive decoding. We propose a novel Coverage-NAT
                                         to model the coverage information directly by a token-level
                                         coverage iterative refinement mechanism and a sentence-level
                                         coverage agreement, which can remind the model if a source             [9], where “coverage” indicates whether a source token is
                                         token has been translated or not and improve the semantics             translated or not and the decoding process is completed when
                                         consistency between the translation and the source, respectively.      all source tokens are “covered”. Despite their successes in AT,
                                         Experimental results on WMT14 En↔De and WMT16 En↔Ro                    existing NAT improvements to those two errors [7], [10], [11]
                                         translation tasks show that our method can alleviate those errors
                                         and achieve strong improvements over the baseline system.              have no explorations in the coverage modeling. It is difficult
                                                                                                                for NAT to model the coverage information due to the non-
                                                                    I. I NTRODUCTION                            autoregressive nature, where step t cannot know if a source
                                            Neural Machine Translation (NMT) has achieved impressive            token has been translated by other steps because all target
                                         performance over the recent years [1]–[5]. NMT models are              tokens are generated simultaneously during decoding.
                                         typically built on the encoder-decoder framework where the                In this paper, we propose a novel Coverage-NAT to address
                                         encoder encodes the source sentence into distributed repre-            this challenge by directly modeling coverage information in
                                         sentations, and the decoder generates the target sentence from         NAT. Our model works at both token level and sentence
                                         the representations in an autoregressive manner: the decoding          level. At token level, we propose a token-level coverage
                                         at step t is conditioned on the predictions of past steps. The         iterative refinement mechanism (TCIR), which models the
                                         autoregressive translation manner (AT) leads to high inference         token-level coverage information with an iterative coverage
                                         latency and restricts NMT applications.                                layer and refines the source exploitation iteratively. Specif-
                                            Recent studies try to address this issue by replacing AT            ically, we replace the top layer of the decoder with the
                                         with non-autoregressive translation (NAT) which generates all          iterative coverage layer and compute the coverage vector in
                                         tokens simultaneously [6]. However, the lack of sequential             parallel. In each iteration, the coverage vector tracks the
                                         dependency between target tokens in NAT makes it difficult             attention history of the previous iteration and then adjusts
                                         for the model to capture the highly multimodal distribution            the attention of the current iteration, which encourages the
                                         of the target translation. This phenomenon is termed as the            model to concentrate more on the untranslated parts and
                                         multimodality problem [6], which usually causes two kinds of           hence reduces over-translation and under-translation errors. In
                                         translation errors during inference: over-translation where the        contrast to the step-wise manner [8], TCIR models the token-
                                         same token is generated repeatedly at consecutive time steps,          level coverage information in an iteration-wise manner, which
                                         and under-translation where the semantics of several source            does not hurt the non-autoregressive generation. At sentence
                                         tokens are not fully translated [7]. Table I shows a German-           level, we propose a sentence-level coverage agreement (SCA)
                                         English example, where “technical” and “has” are translated            as a regularization term. Inspired by the intuition that a good
                                         repeatedly while “also” is not translated at all.                      translation should “cover” the source semantics at sentence
                                            Actually, there are over-translation and under-translation          level, SCA firstly computes the sentence representations of
                                         errors in AT models, too. They are usually addressed through           the translation and the source sentence, and then constrain
                                         explicitly modeling the coverage information step by step [8],         the semantics agreement between them. Thus, our model can
                                                                                                                improve the sentence-level coverage of translation, too.
                                           ?   Yang Feng is the corresponding author.                              We evaluate our model on four translation tasks (WMT14
En↔De, WMT16 En↔Ro). Experimental results show that                 the lengths of the source and target sentence, respectively. The
our method outperforms the baseline system by up to +2.75           translation probability is defined as:
BLEU with competitive speedup (about 10×), and further                                                 T
achieves near-Transformer performance on WMT16 dataset
                                                                                                       Y
                                                                                     P (Y|X, θ) =            p(yt |X, θ)         (3)
but 5.22× faster with the rescoring technique. The main                                                t=1
contributions of our method are as follows:
                                                                    The cross-entropy loss is applied to minimize the negative
  •   To alleviate the over-translation and under-translation       log-likelihood as:
      errors, we introduce the coverage modeling into NAT.                                        T
  •   We propose to model the coverage information at both                                        X
                                                                                   Lmle (θ) = −         log(p(yt |X, θ))         (4)
      token level and sentence level through iterative refinement                                 t=1
      and semantics agreement respectively, both of which are
      effective and achieve strong improvements.                    During training, the target length T is usually set as the
                                                                    reference length. During inference, T is determined by the
                                                                    target length predictor, and then the translation is obtained by
                       II. BACKGROUND
                                                                    taking the token with the maximum likelihood at each step:
   Recently, neural machine translation has achieved impres-                           ŷt = arg max p(yt |X, θ)                 (5)
sive improvements with the autoregressive decoding. Despite                                       yt

its high performance, AT models suffer from long inference
latency. NAT [6] is proposed to address this problem through                            III. C OVERAGE -NAT
parallelizing the decoding process.                                    In this section, we describe the details of Coverage-NAT. We
   The NAT architecture can be formulated into an encoder-          first briefly introduce the NAT architecture in Section III-A.
decoder framework [2]. The same as AT models, the NAT               Then, we illustrate the proposed token-level coverage iterative
encoder takes the embedding of source tokens as inputs and          refinement (TCIR) and sentence-level coverage agreement
outputs the contextual source representations Eenc which are        (SCA) in Section III-B and III-C, respectively. Figure II shows
further used to compute the target length T and inter-attention     the architecture of Coverage-NAT.
at decoder side.
                                                                    A. NAT Architecture
Multi-Head Attention: The attention mechanism used in
Transformer and NAT are multi-head attention (Attn), which             The NAT architecture we used is similar to [6] with the
can be generally formulated as the weighted sum of value            following differences in the decoder: (1) Instead of copying
vectors V using query vectors Q and key vectors K:                  source embedding according to fertility or uniform mapping,
                                                                    we feed a sequence of “hunki” tokens to the NAT decoder
                                 QKT                                directly. (2) We remove the positional attention in decoder
                   A = softmax( √        )
                                  dmodel                      (1)   layers. This design simplifies the decoder but works well in
                    Attn(Q, K, V) = AV                              our experiments.
                                                                       In Coverage-NAT, the bottom L − 1 layers of the decoder
where dmodel represents the dimension of hidden states. There       are identical. Specifically, we design a novel iterative coverage
are 3 kinds of multi-head attention in NAT: multi-head self         layer as the top layer, which refines the source exploitations
attention, multi-head inter-attention and multi-head positional     by iteratively modeling token-level coverage.
attention. Refer to [6] for details.
Encoder: A stack of L identical layers. Each layer consists         B. Token-Level Coverage Iterative Refinement
of a multi-head self attention and a position-wise feed-forward        The token-level coverage information has been applied
network (FFN). FFN consists of two linear transformations           to phrase-based SMT [12] and AT [8], where the decoder
with a ReLU activation [5].                                         maintains a coverage vector to indicate whether a source
                                                                    token is translated or not and then pays more attention to
            FFN(x) = max(0, xW1 + b1 )W2 + b2                 (2)   the untranslated parts. Benefit from the autoregressive nature,
                                                                    AT models can easily model the coverage information step
Note that we omit layer normalization, dropout and residual         by step. Generally, it is a step-wise refinement of source
connections for simplicity. Besides, a length predictor is used     exploitation with the assistance of coverage vector. However,
to predict the target length.                                       it encounters difficulties in non-autoregressive decoding. Since
Decoder: A stack of L identical layers. Each layer consists         all target tokens are generated simultaneously, step t cannot
of a multi-head self attention, a multi-head inter-attention,       know if a source token has been translated by other steps
a multi-head positional attention, and a position-wise feed-        during decoding.
forward network.                                                       The key to solving this problem of NAT lies in: (1) How
   Given the input sentence in the source language X =              to compute the coverage vector under the non-autoregressive
{x1 , x2 , · · · , xn }, NAT models generate a sentence in the      setting? (2) How to use the coverage vector to refine the source
target language Y = {y1 , y2 , · · · , yT } , where n and T are     exploitation under the non-autoregressive setting? To address
(1) Pre-training Phase                                                                 Sentence-Level Coverage Agreement

                                                                                                                           Hypothesis
                (2) Fine-tuning Phase                                                                                      Expectation
                                                                                   Linear & Softmax                                      Mean
                         Training Procedures                  Token-Level Coverage
                                                              Iterative Refinement               K iterations

                                                                                Iterative Coverage Layer

                       Target Length Predictor                    N-1 x

             Nx                                                                       Feed Forward
                                                                                                                                         Mean
                            Feed Forward                                    Multi-Head Inter-Attention

                     Multi-Head Self Attention                                  Multi-Head Self Attention

           Encoder                                                Decoder
                              Embedding                                                Embedding                                         Linear

                                                                                  , , ..., 

Fig. 1. The architecture of Coverage-NAT. We propose the token-level coverage iterative refinement mechanism and the sentence-level coverage agreement
to model the coverage information. At token level, we replace the top layer of the decoder with an iterative coverage layer which models the token-level
coverage information and then refines the source exploitation in an iteration-wise manner. At sentence level, we maximize the source-semantics coverage of
translation by constraining the sentence-level semantical agreement between the translation and the source sentence. We use a two-phase optimization to train
our model.     denotes the matrix multiplication.

it, we propose a novel token-level coverage iterative refinement                  iteration can be given by:
mechanism. Our basic idea is to replace the top layer of the
                                                                                                 Ĥk    =       Attn(Hk−1 , Hk−1 , Hk−1 )                (7)
decoder with an iterative coverage layer, where we model the                                       k                  k
coverage information to refine the source exploitation in an                                     B      =       (1 − C )                                 (8)
                                                                                                                           k
iteration-wise manner. The iterative coverage layer also con-                                                            Ĥ    ETenc
                                                                                                 Ak     =       softmax( √        + λ × Bk )             (9)
tains a multi-head self attention, a multi-head inter-attention                                                            dmodel
and an FFN.                                                                                      Hk     =       FFN(Ak Eenc )                          (10)
   For question (1), we maintain an attention precedence which                    where H is the hidden states of iteration k, λ ∈ R1 is
                                                                                             k
means the former steps have priorities to attend to source                        trainable and is initialized to 1, × means the scalar multipli-
tokens and later steps shouldn’t scramble for those already                       cation. In Equation 7, we compute the self attention and store
“covered” parts. We manually calculate the coverage vector                        the output as Ĥk . In Equation 8-9, we firstly calculate the
Ck on the basis of the inter-attention weights of the previous                    subtraction between 1 and Ck to make the source tokens with
iteration. At step t, we compute the accumulation of past steps’                  a lower coverage obtain a higher bias, and then integrate the
                       k
attention on xi as Ct,i   to represent the coverage of xi :                       bias Bk into the inter-attention weights by a trainable scaling
                                                                                  factor λ. Finally, we output the hidden states of iteration k in
                                     t−1
                                                                                  Equation 10. In this way, each step is discouraged to consider
                                     X                                            source tokens heavily attended by past steps in the previous
                       k
                      Ct,i = min(            Ak−1
                                              t0 ,i , 1)                  (6)
                                                                                  iteration, and pushed to concentrate on source tokens less
                                     t0 =0
                                                                                  attended. The source exploitation is refined iteratively with
                                                                                  token-level coverage, too.
where k is the iteration number, Ak−1                                                We use the hidden states and the inter-attention weights of
                                   t0 ,i is the inter-attention
weights from step t0 to xi in the previous iteration. We use                      (L−1)th layer as H0 and A0 , respectively. After K iterations,
min function to ensure all coverage do not to exceed 1.                           we use HK as the output of the decoder.
  For question (2), we use Ck as a bias of inter-attention                        C. Sentence-Level Coverage Agreement
so that step t will pay more attention to source tokens                             In AT models, the sentence-level agreement between the
untranslated by past steps in the previous iteration. A full                      source side and the target side has been exploited by Yang
et al. [13], which constrains the agreement between source            where β is a hyper-parameter to control the weight of
sentence and reference sentence. However, it can only enhance         Lcoverage . We have tried to train the model with all loss
the representations of the source side and target side instead        functions jointly from scratch. The result is worse than fine-
of improving the semantics consistency between the source             tuning. We think that this is because the SCA will interfere
sentence and the translation.                                         the optimization of translation when the training begins.
   Different from them, we consider the sentence-level agree-
ment from a coverage perspective. Intuitively, a translation                               IV. E XPERIMENTS S ETUP
“covers” source semantics at sentence level if the representa-        Datasets We evaluate the effectiveness of our proposed
tion of translation is as close as possible to the source sentence.   method on WMT14 En↔De (4.5M pairs) and WMT16
Thus, we propose to constrain the sentence-level agreement            En↔Ro (610k pairs) datasets. For WMT14 En↔De, we
between the source sentence and the translation. For the              employ newstest2013 and newstest2014 as development and
source sentence, we compute the sentence representation by            test sets. For WMT16 En↔Ro, we employ newsdev2016 and
considering the whole sentence as a bag-of-words. Specifically,       newstest2016 as development and test sets. We tokenize all
we firstly map the source embeddings to target latent space by        sentences into subword units [18] with 32k merge operations.
a linear transformation, and then use a mean pooling to obtain        We share the vocabulary between the source and target sides.
the sentence representation Ēsrc :                                   Metrics We evaluate the model performance with case-
               Ēsrc = Mean(ReLU(Esrc Ws ))                   (11)    sensitive BLEU [19] for the En-De dataset and case-insensitive
                                                                      BLEU for the En-Ro dataset. Latency is computed as the
where Esrc = {Esrc,1 , · · · , Esrc,n } is the word embeddings        average of per sentence decoding time (ms) on the full test set
of source sentence, Ws ∈ Rdmodel ×dmodel represents the               of WMT14 En→De without mini-batching. We test latency on
parameters of linear transformation and ReLU is the acti-             1 Nvidia RTX 2080Ti GPU.
vation function. For the translation, we need to calculate            Baselines We take the Transformer model [5] as our AT
the expectation of hypothesis at each position, which takes           baseline as well as the teacher model. We choose the following
all possible translation results into consideration. Although         models as our NAT baselines: (1) NAT-FT which equips
it is inconvenient for AT to calculate the expectation by             vanilla NAT with fertility fine-tuning [6]. (2) NAT-Base which
multinomial sampling [14], we can do it easily in NAT due to          is the NAT implementation of Fairseq1 which removes the
the token-independence assumption:                                    multi-head positional attention and feeds the decoder with a
                                                                      sequence of “hunki” instead of copying the source embed-
                          Ehyp,t = pt We                      (12)
                                                                      ding. (3) NAT-IR which iteratively refines the translation for
               dvocab
where pt ∈ R        is the probability vector of hypothesis at        multiple times [15]. (4) Reinforce-NAT which introduces a
step t, We ∈ Rdvocab ×dmodel is the weights of target-side word       sequence-level objective using reinforcement learning [10].
embedding. Then, we calculate the sentence representation of          (5) NAT-REG which proposes two regularization terms to
translation by mean pooling as follows:                               alleviate translation errors [7]. (6) BoN-NAT which minimizes
                                                                      the bag-of-ngrams differences between the translation and the
            Ēhyp = Mean({Ehyp,1 , · · · , Ehyp,T })          (13)
                                                                      reference [11]. (7) Hint-NAT which leverages the hints from
Finally, we compute the L2 distance between the source                AT models to enhance NAT [16]. (8) Flowseq-large which
sentence and the translation as the sentence-level coverage           integrates generative flow into NAT [17].
agreement. Note that we use a scaling factor √d 1        to           Model Configurations We implement our approach on NAT-
                                                   model
stablize the training:                                                Base. We closely follow the settings of [6], [15]. For all
                                                                      experiments, we use the base configurations (dmodel =512,
                                L2(Ēsrc , Ēhyp )
                 Lcoverage =       √                          (14)    dhidden =2048, nlayer =6, nhead =8). In the main experiments,
                                     dmodel                           we set the number of iterations during both training (Ktrain )
D. Two-Phase Optimization                                             and decoding (Kdec ) to 5, α to 0.1 and β to 0.5.
  In our approach, we firstly pre-train the proposed model to         Training We train the AT teacher with the original cor-
convergence with basic loss functions, and then fine-tune the         pus, construct the distillation corpus with the sequence-level
proposed model with the SCA and basic loss functions.                 knowledge distillation [20], and train NAT models with the
Pre-training During the pre-training phase, we optimize the           distillation corpus. We choose Adam [21] as the optimizer.
basic loss functions for NAT:                                         Each mini-batch contains 64k tokens. During pre-training, we
                                                                      stop training when the number of training steps exceeds 400k,
                        L = Lmle + αLlength                   (15)    180k for WMT14 En↔De, WMT16 En↔Ro, respectively.
where Llength is the loss of target length predictor [6]. α is a      The learning rate warms up to 5e-4 in the first 10K steps and
hyper-parameter to control the weight of Llength .                    then decays under the inverse square-root schedule. During
Fine-tuning During the fine-tuning phase, we optimize the             fine-tuning, we fix the learning rate to 1e-5 and stop training
SCA and basic loss functions jointly:                                 when the number of training steps exceeds 1k. We run all

             L = Lmle + αLlength + βLcoverage                 (16)     1 https://github.com/pytorch/fairseq
TABLE II
                       T HE BLEU (%) AND THE SPEEDUP ON TEST SETS OF WMT14 E N↔D E , WMT16 E N↔RO TASKS .

                                            WMT14          WMT14          WMT16         WMT16
  Models                                                                                                  Latency          Speedup
                                            En→De          De→En          En→Ro         Ro→En
  Transformer [5]                            27.19          31.25          32.80         32.61           276.88 ms         1.00×
  Existing NAT Implementations
  NAT-FT [6]                                  17.69          21.47          27.29         29.06              /             15.60×
  NAT-FT (rescoring 10) [6]                   18.66          22.41          29.02         30.76              /             7.68×
  NAT-Base 1                                  18.60          22.70          28.78         29.00          19.36 ms          14.30×
  NAT-IR (idec = 5) [15]                      20.26          23.86          28.86         29.72              /             3.11×
  Reinforce-NAT [10]                          19.15          22.52          27.09         27.93              /             10.73×
  NAT-REG [7]                                 20.65          24.77            /             /                /             27.60×
  BoN-NAT [11]                                20.90          24.61          28.31         29.29              /             10.77×
  Hint-NAT [16]                               21.11          25.24            /             /                /             30.20×
  FlowSeq-large [17]                          23.72          28.39          29.73         30.72              /                /
  Our NAT Implementations
  Coverage-NAT                                21.35          25.04          30.05         30.33          27.58 ms          10.04×
  Coverage-NAT (rescoring 9)                  23.76          27.84          31.93         32.32          53.08 ms          5.22×

experiments on 4 Nvidia RTX 2080Ti GPUs and choose the                                          TABLE III
best checkpoint by BLEU points.                                       A BLATION STUDY ON WMT16 E N→RO DEVELOPMENT SET. W E SHOW
                                                                      THE EFFECTS OF DIFFERENT ITERATIONS DURING TRAINING (Ktrain )
Inference During inference, we employ length parallel de-                            AND DIFFERENT SCA WEIGHTS (β).
coding (LPD) which generates several candidates in parallel
and re-ranks with the AT teacher [6]. During post-process, we                                                                Latency
remove tokens generated repeatedly.                                    Model             Ktrain     β       BLEU (%)          (ms)

                 V. R ESULTS AND A NALYSIS                             NAT-Base             /        /        28.96           20.51

A. Main Results                                                                             1       /      28.77 (-0.19)      20.82
                                                                                            3       /      29.20 (+0.24)      24.84
   Table II shows our main results on WMT14 En↔De and                                       5       /      29.33 (+0.37)      28.34
                                                                       Coverage-NAT        10       /      29.45 (+0.49)      40.80
WMT16 En↔Ro datasets. Coverage-NAT achieves strong
BLEU improvements over NAT-Base by +2.75, +2.34, +1.27                                     15       /      29.61 (+0.65)      48.90
and +1.33 BLEU on WMT14 En→De, WMT14 De→En,                                                 5       0      29.59 (+0.63)      28.34
                                                                                            5       1      30.03 (+1.07)      28.34
WMT16 En→Ro and WMT16 Ro→En, respectively. More-                                            5      0.5     30.16 (+1.20)      28.34
over, Coverage-NAT surpasses most NAT baselines on BLEU
scores with a competitive speedup over Transformer (10.04×).
It is worth mentioning that Coverage-NAT achieves higher             Ktrain and Kdec are the same. With the increase of Ktrain ,
improvements on WMT14 dataset than WMT16 dataset. The                the BLEU firstly drops slightly then increases continuously.
reason may be that it is harder for NAT to align to long             It demonstrates that the TCIR needs enough iterations to
source sentences which are more frequent in WMT14 dataset,           learn effective source exploitation with the help of token-
and Coverage-NAT exactly models the coverage informa-                level coverage information. Because of the trade-off between
tion and hence leads to higher improvements on WMT14                 performance and latency, we set Ktrain to 5 in our main
dataset. With the LPD of 9 candidates, Coverage-NAT achieves         experiments for an equilibrium although Ktrain = 15 brings
+5.10, +5.43, +2.91 and +1.56 BLEU of increases than NAT-            more gains. Second, we evaluate the effects of different SCA
FT (rescoring 10) on WMT14 En→De, WMT14 De→En,                       weights. When we set β to 0, 1 and 0.5, the BLEU increases by
WMT16 En→Ro and WMT16 Ro→En respectively, which                      +0.63, +1.07 and +1.20, respectively. It proves that a moderate
is near the performance of Transformer but 5.22× faster than         β is important for the effects of SCA.
Transformer on WMT16 En↔Ro tasks.
                                                                     C. Effects of Iterations During Decoding
B. Ablation Study                                                       As Figure 2 shows, we use Coverage-NAT (w/o SCA,
   As Table III shows, we conduct the ablation study on              Ktrain = 5) to decode the WMT14 En→De development
WMT16 En→Ro development set to show the effectiveness of             set with different iterations during decoding (Kdec ). As Kdec
each module in Coverage-NAT. First, we evaluate the effects of       grows, the BLEU increases and gets the maximum value near
different iterations during training (Ktrain ). In ablation study,   Ktrain while the latency increases continuously. The reason
&RYHUDJH1$7ZR6&$
%/(8              1$7%DVH%/(8
                                  &RYHUDJH1$7ZR6&$
/DWHQF\           1$7%DVH/DWHQF\                                                               
                                                                                                                                            
                      
                                                                                                                                                                                                
                                                                                                                                        

                                                                                                                                                                                     %/(8
                                                                                                                                                       /DWHQF\PV
                                                                                                                                                                                                
   %/(8

                      
                                                                                                                                            
                                                                                                                                                                                            
                                                                                                                                                                                                                            7UDQVIRUPHU
                                                                                                                                                                                                                    &RYHUDJH1$7
                                                                                                                                                                                                                        &RYHUDJH1$7ZR6&$
                                                                                                                                            
                                                                                                                                                                                                                        1$7%DVH
                      
                                                                                                                   
                                                                                                                                                                                                               >
>
>
>
>
>
>
>
! 
                                                                                 Kdec                                                                                                                                                                           6RXUFH/HQJWK

                                                                                                                                                                                  Fig. 3. The BLEU with different source lengths on the WMT14 En→De
Fig. 2. The BLEU and the latency with different iterations during decoding
                                                                                                                                                                                  development set. Compared to NAT-Base, Coverage-NAT is closer to Trans-
(Kdec ) on WMT14 En→De development set. The results are from Coverage-
                                                                                                                                                                                  former with a slighter decline in “[10,50)” and a greater increase in “[50,80)”.
NAT (w/o SCA, Ktrain = 5). It shows that our model reaches the
maximum BLEU when Kdec = Ktrain , and achieves almost the maximum
improvement when Kdec = 4 which indicates our model can work well with                                                                                                                                         TABLE V
less iterations and lower latency.                                                                                                                                                S UBJECTIVE EVALUATION OF under-translation ERRORS . T HE HIGHER THE
                                                                                                                                                                                                    METRIC , THE LESS THE ERRORS .

                           TABLE IV
  T HE RATIO OF REPEATED TOKENS (%) FOR TRANSLATIONS ON THE                                                                                                                                   Model                                                                           Under-Translation Metric
 WMT14 E N→D E DEVELOPMENT SET. I T PROVES THAT over-translation
  ERRORS GROW AS THE SENTENCE BECOMES LONGER AND THE SCA                                                                                                                                      NAT-Base                                                                                                 58.9%
         ACHIEVES HIGHER IMPROVEMENTS THAN THE TCIR.                                                                                                                                          Coverage-NAT (w/o SCA)                                                                                   67.1%
                                                                                                                                                                                              Coverage-NAT                                                                                             70.6%
  Model                                                      Short                                   Long                                   All
  NAT-Base                                                 5.47                                      12.05                           10.23
  Coverage-NAT                                         3.31 (-2.16)                              7.52 (-4.53)                     6.36 (-3.87)                                    It leads to 0.65% and 1.84% decreases of repeated tokens
    – SCA                                              4.82 (-0.65)                              10.21 (-1.84)                    8.72 (-1.51)                                    on short and long sentences, respectively. The results prove
                                                                                                                                                                                  that both token-level and sentence-level coverage can reduce
                                                                                                                                                                                  over-translation errors and sentence-level coverage contributes
that the performance doesn’t increase continuously may be that                                                                                                                    more than token-level coverage.
our model are trained to provide the best coverage at Ktrain
and this consistency will be broken if Kdec stays away from                                                                                                                       E. Subjective Evaluation of Under-Translation
Ktrain . Our model surpasses NAT-Base when Kdec = 3 and                                                                                                                              We conduct a subjective evaluation to validate our im-
achieves almost the maximum BLEU when Kdec = 4, which                                                                                                                             provements to under-translation errors. For each sentence, two
indicates our model can work well with less iterations and                                                                                                                        human evaluators are asked to select the one with the least
lower latency than Ktrain during decoding.                                                                                                                                        under-translation errors from translations of three anonymous
                                                                                                                                                                                  systems. We evaluate a total of 300 sentences sampled ran-
D. Analysis of Repeated Tokens                                                                                                                                                    domly from the WMT14 De→En test set. Then, we calculate
   We study the ratio of repeated tokens during post-process                                                                                                                      the under-translation metric which is the frequency that each
to see how our methods improve the translation quality.                                                                                                                           system is selected. The higher the under-translation metric, the
We split the WMT14 En→De development set equally into                                                                                                                             less the under-translation errors. Table V shows the results.
the short part and the long part according to the sentence                                                                                                                        Note that there are some tied selections so that the sum of
length. Then, we translate them using NAT models and score                                                                                                                        three scores does not equal to 1. It proves that Coverage-
the ratio of repeated tokens during post-process. As shown                                                                                                                        NAT produces less under-translation errors than NAT-Base,
in Table IV, NAT-base suffers from over-translation errors                                                                                                                        and both word-level and sentence-level coverage modeling are
especially on the long sentences, which is in line with [11].                                                                                                                     effective in reducing them.
As the sentence length grows, it becomes harder for NAT
to align to source, which leads to more translation errors                                                                                                                        F. Effects of Source Length
and eventually damages the translation quality. However, the                                                                                                                        To evaluate the effects of sentence lengths, we split the
ratio of repeated tokens reduces obviously in Coverage-NAT                                                                                                                        WMT14 En→De development set into different buckets ac-
by 2.16% and 4.53% on short and long sentences, which                                                                                                                             cording to the source length and compare the BLEU perfor-
proves that our model achieves higher improvements as the                                                                                                                         mance of our model with other baselines. Figure 3 shows
sentence length becomes longer. Moreover, we remove the                                                                                                                           the results. As the sentence length grows, the performance
SCA from Coverage-NAT and compare it with NAT-Base.                                                                                                                               of NAT-Base firstly drops continuously, then has a slight rise
TABLE VI
 T RANSLATION EXAMPLES FROM THE WMT14 D E→E N TEST SET. W E USE UNDERLINES TO REPRESENT over-translation ERRORS AND WAVY LINES TO
  REPRESENT under-translation ERRORS . N OTE THAT THE TEXT IN BRACKETS IS ADDED TO REPRESENT THE MISSING PARTS OTHER THAN THE EXACT
            :::::::::::
                      TRANSLATION . O UR MODEL CAN ALLEVIATE THOSE ERRORS AND GENERATE BETTER TRANSLATIONS .

  Source                        Edis erklärte , Coulsons Aktivitäten bei dieser Story folgten demselben Muster wie bei anderen
                                wichtigen Persönlichkeiten , etwa dem früheren Innenminister David Blunkett .
  Reference                     Mr Edis said Coulson 's involvement in the story followed the same pattern as with other
                                important men , such as former home secretary David Blunkett .
  Transformer                   Edis explained that Coulson ’ s activities in this story followed the same pattern as those of other
                                important personalities , such as former Interior Minister David Blunkett .
  NAT-Base                      Edis explained that Coulson ’ s activities in this story followed the same pattern as other other
                                (important) person personalities such as former former (Inter)ior
                                :::::::                                                   ::::::
                                                                                                   Minister David Blunkett .
  Coverage-NAT (w/o SCA)        Edis explained that Coulson ’ s activities in this story followed the same pattern as other other
                                important personalities , such as former Interior Minister David Blunkett .
  Coverage-NAT                  Edis explained that Coulson ’ s activities in this story followed the same pattern as with other
                                important personalities , such as former Interior Minister David Blunkett .

and finally declines heavily on very long sentences. However,      classification loss to guide the NAT model. Shao et al. [10],
Transformer can keep a high performance until the sentence         [11] train NAT with sequence-level objectives and bag-of-
length becomes very long, which indicates that translating         ngrams loss. Wang et al. [7] propose NAT-REG with two
very long sentences is exactly difficult. Compared to NAT-         auxiliary regularization terms to alleviate over-translation and
Base, the result of Coverage-NAT is closer to Transformer,         under-translation errors. Guo et al. [26] enhance the input
which has a more slight decline in “[10,50)” and a greater         of the decoder with target-side information. Some works
increase in “[50,80)”. The result of Coverage-NAT (w/o SCA)        introduce latent variables to guide generation [17], [23], [27],
lies between Coverage-NAT and NAT-Base. It demonstrates            [28]. Ran et al. [29] and Bao et al. [30] consider the reordering
that our model generates better translations as the sentence       information in decoding. Li et al. [16] use the AT to guide the
becomes longer and both token-level and sentence-level cov-        NAT training. Wang et al. [31] and Ran et al. [32] further
erage modeling are effective to improve the translation quality,   propose semi-autoregressive methods.
in line with Table IV.                                                To alleviate the over-translation and under-translation er-
                                                                   rors, coverage modeling is commonly used in SMT [12] and
G. Case Study
                                                                   NMT [8], [9], [33]–[35]. Tu et al. [8] and Mi et al. [9] model
   We present several translation examples on the WMT14            the coverage in a step-wise manner and employ it as additional
De→En test set in Table VI, including the source, the ref-         source features, which can not be applied to NAT directly.
erence and the translations given by Transformer, NAT-Base,        However, our TCIR models the coverage in an iteration-wise
Coverage-NAT (w/o SCA) and Coverage-NAT. As it shows,              manner and employs it as attention biases, which does not
NAT-Base suffers severe over-translation (e.g. “other other”)      hurt the non-autoregressive generation. Li et al. [33] introduce
and under-translation (e.g. “important”) errors. However, Our      a coverage-based feature into NMT. Zheng et al. [34], [35]
Coverage-NAT can alleviate these errors and generate higher        explicitly model the translated and untranslated contents.
quality translations. The differences between Coverage-NAT            Agreement-based learning in NMT has been explored in
(w/o SCA) and Coverage-NAT further prove the effectiveness         [36]–[39]. Yang et al. [13] propose a sentence-level agree-
of sentence-level coverage.                                        ment between source and reference while our SCA considers
                      VI. R ELATED W ORK                           the sentence-level agreement between source and translation.
   Gu et al. [6] introduce non-autoregressive neural machine       Besides, Tu et al. [40] propose to reconstruct source from
translation to accelerate inference. Recent studies have tried     translation, which improves the translation adequacy but can
to bridge the gap between non-autoregressive models and            not model the sentence-level coverage directly.
autoregressive models. Lee et al. [15] refine the translation
                                                                                          VII. C ONCLUSION
iteratively where the outputs of decoder are fed back as inputs
in the next iteration. Ghazvininejad et al. [22] propose a mask-      In this paper, we propose a novel Coverage-NAT which
predict method to generate translations in several decoding        explicitly models the coverage information at token level and
iterations. [23], [24] are based on iterative refinements, too.    sentence level to alleviate the over-translation and under-
These methods refine the translation by decoding iteratively       translation errors in NAT. Experimental results show that
while our TCIR refines the source exploitation by iteratively      our model can achieve strong BLEU improvements with
modeling the coverage with an individual layer. Besides,           competitive speedup over the baseline system. The analysis
Libovickỳ and Helcl [25] introduce a connectionist temporal       further prove that it can effectively reduce translation errors.
R EFERENCES                                         [20] Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in
                                                                                       Proceedings of the 2016 Conference on Empirical Methods in Natural
 [1] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,               Language Processing, 2016, pp. 1317–1327.
     H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn        [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
     encoder–decoder for statistical machine translation,” in EMNLP, 2014,             ICLR, 2015.
     pp. 1724–1734.                                                               [22] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict:
 [2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning            Parallel decoding of conditional masked language models,” in Pro-
     with neural networks,” in Advances in neural information processing               ceedings of the 2019 Conference on Empirical Methods in Natural
     systems, 2014, pp. 3104–3112.                                                     Language Processing and the 9th International Joint Conference on
 [3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by                Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6114–6123.
     jointly learning to align and translate,” 3rd International Conference on    [23] R. Shu, J. Lee, H. Nakayama, and K. Cho, “Latent-variable non-
     Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,                 autoregressive neural machine translation with deterministic inference
     2015, Conference Track Proceedings, 2014.                                         using a delta posterior,” Proceedings of the AAAI Conference on Artifi-
                                                                                       cial Intelligence, 2019.
 [4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
                                                                                  [24] J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,” in Advances
     M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
                                                                                       in Neural Information Processing Systems, 2019, pp. 11 179–11 189.
     machine translation system: Bridging the gap between human and
                                                                                  [25] J. Libovickỳ and J. Helcl, “End-to-end non-autoregressive neural ma-
     machine translation,” arXiv: Computation and Language, 2016.
                                                                                       chine translation with connectionist temporal classification,” in Proceed-
 [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
                                                                                       ings of the 2018 Conference on Empirical Methods in Natural Language
     Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
                                                                                       Processing, 2018, pp. 3016–3021.
     in neural information processing systems, 2017, pp. 5998–6008.
                                                                                  [26] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T.-Y. Liu, “Non-autoregressive
 [6] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, “Non-                      neural machine translation with enhanced decoder input,” in Proceedings
     autoregressive neural machine translation,” in 6th International Confer-          of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp.
     ence on Learning Representations, ICLR 2018, Vancouver, BC, Canada,               3723–3730.
     April 30 - May 3, 2018, Conference Track Proceedings. OpenRe-                [27] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and
     view.net, 2018.                                                                   N. Shazeer, “Fast decoding in sequence models using discrete latent
 [7] Y. Wang, F. Tian, D. He, T. Qin, C. Zhai, and T.-Y. Liu, “Non-                    variables,” in ICML, 2018.
     autoregressive machine translation with auxiliary regularization,” in        [28] N. Akoury, K. Krishna, and M. Iyyer, “Syntactically supervised trans-
     AAAI, vol. 33, 2019, pp. 5377–5384.                                               formers for faster neural machine translation,” in ACL, 2019, pp. 1269–
 [8] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for neural            1281.
     machine translation,” in Proceedings of the 54th Annual Meeting of the       [29] Q. Ran, Y. Lin, P. Li, and J. Zhou, “Guiding non-autoregressive neural
     Association for Computational Linguistics (Volume 1: Long Papers),                machine translation decoding with reordering information,” AAAI, 2021.
     2016, pp. 76–85.                                                             [30] Y. Bao, H. Zhou, J. Feng, M. Wang, S. Huang, J. Chen, and L. Li,
 [9] H. Mi, B. Sankaran, Z. Wang, and A. Ittycheriah, “Coverage embedding              “Non-autoregressive transformer by position learning,” arXiv preprint
     models for neural machine translation,” in EMNLP, 2016, pp. 955–960.              arXiv:1911.10677, 2019.
[10] C. Shao, Y. Feng, J. Zhang, F. Meng, X. Chen, and J. Zhou, “Retrieving       [31] C. Wang, J. Zhang, and H. Chen, “Semi-autoregressive neural machine
     sequential information for non-autoregressive neural machine transla-             translation,” in Proceedings of the 2018 Conference on Empirical
     tion,” in Proceedings of the 57th Annual Meeting of the Association for           Methods in Natural Language Processing, 2018, pp. 479–488.
     Computational Linguistics, 2019, pp. 3013–3024.                              [32] Q. Ran, Y. Lin, P. Li, and J. Zhou, “Learning to recover from multi-
[11] C. Shao, J. Zhang, Y. Feng, F. Meng, and J. Zhou, “Minimizing the bag-            modality errors for non-autoregressive neural machine translation,” ACL,
     of-ngrams difference for non-autoregressive neural machine translation,”          2020.
     Proceedings of the AAAI Conference on Artificial Intelligence, 2020.         [33] Y. Li, T. Xiao, Y. Li, Q. Wang, C. Xu, and J. Zhu, “A simple and effective
[12] P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based translation,”        approach to coverage-aware neural machine translation,” in ACL, 2018,
     in Proceedings of the 2003 Conference of the North American Chapter               pp. 292–297.
     of the Association for Computational Linguistics on Human Language           [34] Z. Zheng, H. Zhou, S. Huang, L. Mou, X. Dai, J. Chen, and Z. Tu,
     Technology-Volume 1, 2003, pp. 48–54.                                             “Modeling past and future for neural machine translation,” Transactions
[13] M. Yang, R. Wang, K. Chen, M. Utiyama, E. Sumita, M. Zhang, and                   of the Association for Computational Linguistics, vol. 6, pp. 145–157,
     T. Zhao, “Sentence-level agreement for neural machine translation,”               2018.
     in Proceedings of the 57th Annual Meeting of the Association for             [35] Z. Zheng, S. Huang, Z. Tu, X. Dai, and C. Jiajun, “Dynamic past
     Computational Linguistics, 2019, pp. 3076–3082.                                   and future for neural machine translation,” in Proceedings of the 2019
[14] S. Chatterjee and N. Cancedda, “Minimum error rate training by                    Conference on Empirical Methods in Natural Language Processing and
     sampling the translation lattice,” in EMNLP, 2010, pp. 606–615.                   the 9th International Joint Conference on Natural Language Processing
[15] J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressive                (EMNLP-IJCNLP), 2019, pp. 930–940.
     neural sequence modeling by iterative refinement,” in Proceedings            [36] L. Liu, M. Utiyama, A. Finch, and E. Sumita, “Agreement on target-
     of the 2018 Conference on Empirical Methods in Natural Language                   bidirectional neural machine translation,” in Proceedings of the 2016
     Processing, 2018, pp. 1173–1182.                                                  Conference of the North American Chapter of the Association for
[16] Z. Li, Z. Lin, D. He, F. Tian, Q. Tao, W. Liwei, and T.-Y. Liu,                   Computational Linguistics: Human Language Technologies, 2016, pp.
     “Hint-based training for non-autoregressive machine translation,” in              411–416.
     Proceedings of the 2019 Conference on Empirical Methods in Natural           [37] Y. Cheng, S. Shen, Z. He, W. He, H. Wu, M. Sun, and Y. Liu,
     Language Processing and the 9th International Joint Conference on                 “Agreement-based joint training for bidirectional attention-based neural
     Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5712–5717.                  machine translation,” in Proceedings of the Twenty-Fifth International
[17] X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy, “Flowseq: Non-                     Joint Conference on Artificial Intelligence, ser. IJCAI’16. AAAI Press,
     autoregressive conditional sequence generation with generative flow,” in          2016, p. 2761–2767.
     Proceedings of the 2019 Conference on Empirical Methods in Natural           [38] Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, and T. Xu, “Regularizing
     Language Processing and the 9th International Joint Conference on                 neural machine translation by target-bidirectional agreement,” in AAAI,
     Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4273–4283.                  vol. 33, 2019, pp. 443–450.
[18] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation            [39] M. Al-Shedivat and A. Parikh, “Consistency by agreement in zero-shot
     of rare words with subword units,” in Proceedings of the 54th Annual              neural machine translation,” in Proceedings of the 2019 Conference
     Meeting of the Association for Computational Linguistics (Volume 1:               of the North American Chapter of the Association for Computational
     Long Papers), 2016, pp. 1715–1725.                                                Linguistics: Human Language Technologies, Volume 1 (Long and Short
[19] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method                   Papers), 2019, pp. 1184–1197.
     for automatic evaluation of machine translation,” in Proceedings of          [40] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li, “Neural machine translation
     the 40th annual meeting on association for computational linguistics.             with reconstruction,” in Thirty-First AAAI Conference on Artificial
     Association for Computational Linguistics, 2002, pp. 311–318.                     Intelligence, 2017.
You can also read