Re-translation versus Streaming for Simultaneous Translation

Page created by Jamie Gordon
 
CONTINUE READING
Re-translation versus Streaming for Simultaneous Translation

                                                    Naveen Arivazhagan∗, Colin Cherry∗, Wolfgang Macherey and George Foster
                                                                                Google Research
                                                         {navari,colincherry,wmach,fosterg}@google.com

                                                                  Abstract                       however, the prohibition against revising output is
                                                                                                 overly stringent.
                                             There has been great progress in improving             The ability to revise previous partial translations
arXiv:2004.03643v2 [cs.CL] 14 Apr 2020

                                             streaming machine translation, a simultane-
                                                                                                 makes simply re-translating each successive source
                                             ous paradigm where the system appends to
                                             a growing hypothesis as more source con-            prefix a viable strategy. Compared to streaming
                                             tent becomes available. We study a related          models, re-translation has the advantage of low la-
                                             problem in which revisions to the hypothe-          tency, since it always attempts a translation of the
                                             sis beyond strictly appending words are per-        complete source prefix, and high final-translation
                                             mitted. This is suitable for applications such      quality, since it is not restricted to preserving pre-
                                             as live captioning an audio feed. In this           vious output. It has the disadvantages of higher
                                             setting, we compare custom streaming ap-            computational cost, and a high revision rate, vis-
                                             proaches to re-translation, a straightforward
                                                                                                 ible as textual instability in an online translation
                                             strategy where each new source token triggers
                                             a distinct translation from scratch. We find re-    display. When revisions are an option, it is unclear
                                             translation to be as good or better than state-     whether one should prefer a specialized streaming
                                             of-the-art streaming systems, even when op-         model, or a re-translation strategy.
                                             erating under constraints that allow very few          In light of this, we make the following contri-
                                             revisions. We attribute much of this success        butions: (1) We evaluate three techniques to im-
                                             to a previously proposed data-augmentation
                                                                                                 prove re-translation that have not previously been
                                             technique that adds prefix-pairs to the training
                                             data, which alongside wait-k inference forms a
                                                                                                 combined. (2) We provide the first empirical com-
                                             strong baseline for streaming translation. We       parison of re-translation and streaming models,
                                             also highlight re-translation’s ability to wrap     demonstrating that re-translation operating in a
                                             arbitrarily powerful MT systems with an ex-         very low-revision regime can match or beat the
                                             periment showing large improvements from an         quality-latency trade-offs of streaming models. (3)
                                             upgrade to its base model.                          We test a 0-revision configuration of re-translation,
                                                                                                 and show that it is surprisingly competitive, due to
                                         1   Introduction                                        the effectiveness of data augmentation with prefix
                                         In simultaneous machine translation, the goal is        pairs.
                                         to translate an incoming stream of source words
                                         with as low latency as possible. A typical applica-     2   Related Work
                                         tion is speech translation, where target words must
                                                                                                 Cho and Esipova (2016) propose the first streaming
                                         be appended to existing output with no possibil-
                                                                                                 techniques for NMT, using heuristic agents based
                                         ity for revision. The corresponding task, which
                                                                                                 on model scores, while Gu et al. (2017) extend their
                                         we refer to as streaming translation, has received
                                                                                                 work with agents learned using reinforcement learn-
                                         considerable recent attention, generating custom
                                                                                                 ing. Ma et al. (2019a) recently broke new ground
                                         approaches designed to maximize quality and min-
                                                                                                 by integrating their read-write agent directly into
                                         imize latency (Cho and Esipova, 2016; Gu et al.,
                                                                                                 NMT training. Similar to Dalvi et al. (2018), they
                                         2017; Dalvi et al., 2018; Ma et al., 2019a). For
                                                                                                 employ a simple agent that first reads k source to-
                                         certain related applications such as live captioning,
                                                                                                 kens, and then proceeds to alternate between writes
                                             ∗
                                                 Equal contributions                             and reads until the source sentence has finished.
This agent is easily integrated into NMT training,       3.2   Latency
which allows the NMT engine to learn to anticipate       Latency is the amount of time the target listener
occasionally-missing source context. We employ           spends waiting for their translation. Most latency
their wait-k training as a baseline, and use their       metrics are based on a delay vector g, where gj
wait-k inference to improve re-translation. Our sec-     reports how many source tokens were read be-
ond and strongest streaming baseline is the MILk         fore writing the j th target token (Cho and Esipova,
approach of Arivazhagan et al. (2019b), who im-          2016). This delay is trivial to determine for stream-
prove upon wait-k training with a hierarchical at-       ing systems, but to address the scenario where tar-
tention that can adapt how it will wait based on the     get content can change, we introduce the notion of
current context. Both wait-k training and MILk           content delay.
attention provide hyper-parameters to control their         We take the pessimistic view that content in flux
quality-latency trade-offs: k for wait-k, and latency    is useless; for example, in Table 1, the 4th target
weight for MILk.                                         token first appears in step 4, but only becomes
   Re-translation was originally investigated by         useful in step 7, when it shifts from be to slow.
Niehues et al. (2016, 2018), and more recently ex-       Therefore, we calculate delay with respect to when
tended by Arivazhagan et al. (2019a), who propose        a token finalizes. Let oi,j be the j th token of the ith
an evaluation framework suitable for re-translation,     output in a PTL; 1 ≤ i ≤ I and 1 ≤ j ≤ J. For
and use it to assess inference-time re-translation       each position j in the final output, we define gj as:
strategies for speech translation. We adopt their
inference-time heuristics to stabilize re-translation,    gj = min s.t. oi0 ,j 0 = oI,j 0 ∀i0 ≥ i and ∀j 0 ≤ j
                                                                     i
and extend them with prefix training from Niehues
et al. (2018). Where they experiment on TED talks,       that is, the number of source tokens read before
compare only to vanilla re-translation and use pro-      the prefix ending in j took on its final value. The
prietary NMT, we follow recent work on streaming         Content Delay row in Table 1 shows delays for
by using WMT training and test data, and provide         our running example. Note that this is identical
a novel comparison to streaming approaches.              to standard delay for streaming systems, which
                                                         always have stable prefixes.
3     Metrics                                               With this refined g, we can make several latency
                                                         metrics content-aware, including average propor-
We adapt the evaluation framework from Arivazha-         tion (Cho and Esipova, 2016), consecutive wait (Gu
gan et al. (2019a), which includes metrics for la-       et al., 2017), average lagging (Ma et al., 2019a),
tency, stability, and quality. Where they measure la-    and differentiable average lagging (Arivazhagan
tency with a temporal lag, we adopt an established       et al., 2019b). We opt for differentiable average lag-
token lag that does not rely on machine speed.           ging (DAL) because of its interpretability and be-
   Our evaluation is built around a prefix transla-      cause it sidesteps some problems with average lag-
tion list (PTL), which can be generated for any          ging (Cherry and Foster, 2019). It can be thought
streaming or re-translation system. For each token       of as the average number of source tokens a system
in the source sentence (after merging subwords),         lags behind a perfectly simultaneous translator:
this list stores the tokenized system output. Table 1                                  J
                                                                                     1 X 0 j−1
shows an example. We use I for the final number                              DAL =       gj −
of source tokens, and J for the final number of                                      J        γ
                                                                                       j=1
target tokens.
                                                         where γ = J/I accounts for the source and target
3.1    Quality                                           having different lengths, and g 0 adjusts g to incor-
                                                         porate a minimal time cost of γ1 for each token:
Translation quality is measured by calculating                               
BLEU (Papineni et al., 2002) on the final output of                              gj                j=1
                                                               gj0       =                 0     1
each PTL; that is, standard corpus-level BLEU on                                 max gj , gj−1 + γ j > 1
complete translations. We make no attempt to di-
rectly measure the quality of intermediate outputs;      3.3   Stability
instead, their quality is captured indirectly through    Following Niehues et al. (2016, 2018) and Ari-
final output quality and stability.                      vazhagan et al. (2019a), we measure stability with
Source              Output                                                                          Erasure
      1: Neue             New                                                                                -
      2: Arzneimittel     New Medicines                                                                      0
      3: könnten         New Medicines                                                                      0
      4: Lungen-          New    drugs            may     be     lung                                        1
      5: und              New    drugs           could    be     lung     and                                3
      6: Eierstockkrebs   New    drugs            may     be     lung     and     ovarian     cancer         4
      7: verlangsamen     New    drugs            may    slow    lung     and     ovarian     cancer         5
      Content Delay        1       4               6       7       7       7         7          7

Table 1: An example prefix translation list for the tokenized German sentence, “Neue Arzneimittel könnten Lungen-
und Eierstockkrebs verlangsamen”, with reference, “New drugs may slow lung , ovarian cancer”.

erasure, which measures the length of the suffix          4.2    Inference-time Heuristics
that is deleted to produce the next revision. Let oi      To improve stability, Arivazhagan et al. (2019a)
be the ith output of a PTL. The normalized erasure        propose a combination of biased search and de-
(NE) for PTL is defined as:                               layed predictions. Biased search encourages the
                 I                                        system to respect its previous predictions by modi-
               1X
        NE =       |oi−1 | − |LCP(oi , oi−1 )|            fying search to interpolate between the distribution
               J
                 i=2                                      from the NMT model (with weight 1 − β) and the
where the | · | operator returns the length of a token    one-hot distribution formed by the system’s trans-
sequence, and LCP calculates the longest common           lation of the previous prefix (with weight β).
prefix of two sequences. Table 1 shows pointwise             To delay predictions until more source context
erasures for each output; its NE would be 13/8 =          is available, we adopt Ma et al. (2019a)’s wait-k
1.625, interpretable as the number of intermediate        approach at inference time to truncate the target to
tokens deleted for each final token.                      a maximum length determined by the source length
                                                          less a constant k. To avoid confusion with Ma et al.
4     Re-translation Methods                              (2019a)’s wait-k training, we refer to wait-k used
To evaluate re-translation, we build up the source        for re-translation as wait-k inference.1
sentence one token at a time, translating each re-
sulting source prefix from scratch to construct the       5     Experiments
PTL for evaluation.                                       We use standard WMT14 English-to-French (EnFr;
                                                          36.3M sentences) and WMT15 German-to-English
4.1     Prefix Training
                                                          (DeEn; 4.5M sentences) data. For EnFr, we use
Standard models trained on full sentences are un-         newstest 2012+2013 for development, and newstest
likely to perform well when applied to prefixes. We       2014 for test. For DeEn, we validate on newstest
alleviate this problem by generating prefix pairs         2013 and report results on newstest 2015. We use
from our parallel training corpus, and subsequently       BPE (Sennrich et al., 2016) on the training data
training on a 1:1 mix of full-sentence and prefix         to construct a 32K-type vocabulary that is shared
pairs (Niehues et al., 2018; Dalvi et al., 2018). Fol-    between the source and target languages.
lowing Niehues et al. (2018), we augment our train-
ing data with prefix pairs created by selecting a         5.1    Models
source prefix length uniformly at random, then se-        Our streaming and re-translation models are im-
lecting a target length either proportionally accord-     plemented in Lingvo (Shen et al., 2019), sharing
ing to sentence length, or based on self-contained        architecture and hyper-parameters wherever possi-
word alignments. In preliminary experiments, we           ble. Our RNMT+ architecture (Chen et al., 2018)
confirmed a finding by Niehues et al. (2018) that         consists of a 6 layer LSTM encoder and an 8 layer
word-alignment-based prefix selection is no better
                                                             1
than proportional selection, so we report results              When wait-k truncation is combined with beam search, its
                                                          behavior is similar to that of Zheng et al. (2019b): sequences
only for the proportional method. An example of           are scored accounting for “future” tokens that will not be
proportional prefix training is given in Table 2.         shown to the user.
Source    Die Führungskräfte der Republikaner rechtfertigen ihre Politik mit der
      Full
                         Notwendigkeit , den Wahlbetrug zu bekämpfen [15 tokens]
               Target    Republican leaders justified their policy by the need to combat electoral
                         fraud [12 tokens]

               Source    Die Führungskräfte der Republikaner rechtfertigen [5 tokens]
      Prefix
               Target    Republican leaders justified their [4 tokens]

Table 2: An example of proportional prefix training. Each example in the minibatch has a 50% chance to be
truncated, in which case, we truncate its source and target to a randomly-selected fraction of their original lengths,
1/3 in this example. No effort is made to ensure that the two halves of the prefix pair are semantically equivalent.

LSTM decoder with additive attention (Bahdanau               by normalized erasure (NE in § 3.3), to negligible
et al., 2014). Both encoder and decoder LSTMs                levels (Arivazhagan et al., 2019a). But how does re-
have 512 hidden units, apply per-gate layer nor-             translation compare to competing approaches? To
malization (Ba et al., 2016), and use residual skip          answer this, we compare the quality-latency trade-
connections after the second layer.                          offs achieved by re-translation in a low-revision
   The models are regularized using a dropout of             regime to those of our streaming baselines.
0.2 and label smoothing of 0.1 (Szegedy et al.,                 First, we need a clear definition of low-revision
2016). Models are optimized using 32-way data                re-translation. By manual inspection on the DeEn
parallelism with Google Cloud’s TPUv3, using                 development set, we observe that systems with an
Adam (Kingma and Ba, 2015) with the learning                 NE of 0.2 or lower display many different latency-
rate schedule described in Chen et al. (2018) and a          quality trade-offs. But is NE stable across evalu-
batch size of 4,096 sentence-pairs. Checkpoints for          ation sets? When we compare development and
the base models are selected based on development            test NE for all 50 non-zero-erasure combinations
perplexity.                                                  of β and k, the average absolute difference is 0.005
Streaming We train several wait-k training and               for DeEn, and 0.004 for EnFr, indicating that de-
MILk models to obtain a range of quality-latency             velopment NE is very predictive of test NE. This
trade-offs. Five wait-k training models are trained          gives us an operational definition of low-revision
with sub-word level waits of 2, 4, 6, 8, and 10. Five        re-translation as any configuration with a dev NE
MILk models are trained with latency weights of              < 0.2, allowing on average less than 1 token to be
0.2, 0.3, 0.4, 0.5 and 0.75. All streaming models            revised for every 5 tokens in the system output.
use unidirectional encoders and greedy search.                  Since we need to vary both β and k for our re-
Re-translation We test two NMT architectures                 translation systems, we plot BLEU versus DAL
with re-translation: a Base system with unidirec-            curves by finding the Pareto frontier on the dev set,
tional encoding and greedy search, designed for              and then projecting to the test set. To ensure a fair
fair comparisons to our streaming baselines above;           comparison to our baselines, we test only the Base
and a more powerful Bidi+Beam system using bidi-             system here. As an ablation, we include a variant
rectional encoding and beam search of size 20,               that does not use proportional prefixes, and instead
designed to test the impact of an improved base              trains only on full sentences.
model. Training data is augmented through the                  Figure 1 shows our results. Re-translation is
proportional prefix training method unless stated           nicely separated from wait-k, and intertwined with
otherwise (§ 4.1). Beam-search bias β is varied in          the adaptive MILk. In fact, it is noticeably better
the range 0.0 to 1.0 in increments of 0.2. When             than MILk at several latency levels for EnFr. Since
wait-k inference is enabled, k is varied in 1, 2, 4,        re-translation is not adaptive, this indicates that
6, 8, 10, 15, 20, 30. Note that we do not need to           being able to make a small number of revisions is
re-train to test different values of β or k.                quite advantageous for finding good quality-latency
                                                            trade-offs. On the other hand, the ablation curve,
5.2    Translation with few revisions                       “Re-trans NE < 0.2 No Prefix” is much worse, in-
Biased search and wait-k inference used together            dicating that proportional prefix training is very
can reduce re-translation’s revisions, as measured          valuable in this setting. We probe its value further
Figure 1: BLEU vs lag (DAL) curves for translation with low erasure for DeEn (left) and EnFr (right).

in the next experiment.                                           ized training, and confirming our earlier observa-
                                                                  tions on the surprising effectiveness of prefix train-
5.3    Translation with no revisions                              ing. Finally, we see that even without revisions,
Motivated by the strong performance of re-                        re-translation is very close to MILk, suggesting
translation with few revisions, we now evaluate                   that this combination of prefix training and wait-k
it with no revisions, by setting β to 1, which guar-              inference is an extremely strong baseline, even for
antees NE = 0. Since β is locked at 1, we can build               a 0-revision regime.
a curve by varying k from 2 to 10 in increments
of 2. In this setting, re-translation becomes equiv-              5.4    Extendability of re-translation
alent to wait-k inference without wait-k training,                Re-translation’s primary strengths lie in its ability
which is studied as an ablation to wait-k training                to revise and its ability to apply to any MT sys-
by Ma et al. (2019a).2 However, where they tested                 tem. With some effort, streaming systems can be
wait-k inference on a system with full-sentence                   fitted with enhancements such as bidirectional en-
training, we do so for a system with proportional                 coding (Ma et al., 2019a),3 beam search (Zheng
prefix training (§ 4.1). As before, we compare to                 et al., 2019b) and multihead attention (Ma et al.,
our streaming baselines, test only our Base system,               2019b). Conversely, re-translation can wrap any
and include a no-prefix ablation corresponding to                 auto-regressive NMT system and immediately ben-
full-sentence training.                                           efit from its improvements. Furthermore, re-
   Results are shown in Figure 2. First, re-                      translation’s latency-quality trade-off can be ma-
translation outperforms wait-k training at almost                 nipulated without retraining the base system. It is
all latency levels. This is startling, because each               not the only solution to have these properties; most
wait-k training point is trained specifically for its             policies that are not trained jointly with NMT can
k, while the re-translation points reflect a single               make the same claims (Cho and Esipova, 2016; Gu
training run, reconfigured for different latencies by             et al., 2017; Zheng et al., 2019a). We conduct an
adjusting k at test time. We suspect that this im-                experiment to demonstrate the value of this flexibil-
provement stems from prefix-training introducing                  ity, by comparing our Base system to the upgraded
stochasticity to the amount of source context used                Bidi+Beam. We carry out this test with few revi-
to predict target words, making the model more ro-                sions (NE < 0.2) and without revisions (NE = 0),
bust. Second, without prefix training, re-translation             projecting Pareto curves from dev to test where
is consistently below wait-k training, confirming                 necessary. The results are shown in Figure 3.
earlier experiments by Ma et al. (2019a) on the                       Comparing the few-revision (NE < 0.2) curves,
ineffectiveness of wait-k inference without special-              we see large improvements, some more than 2
   2
     Re-translation with beam search and with β = 1 is similar    BLEU points, from using better models. Look-
to wait-k inference with speculative beam search (Zheng et al.,
                                                                     3
2019b), due to effective look-ahead from implementing wait-k          Training a streaming model with bidirectional encoding
with truncation. However, we only evaluate greedy search in       comes at a quadratic (as opposed to linear) memory cost in the
this comparison, where their equivalence is exact.                source sequence length, necessitating much smaller batches.
Figure 2: BLEU vs lag (DAL) curves for translation with no erasure, for DeEn (left) and EnFr (right).

 Figure 3: BLEU vs lag (DAL) curves for re-translation with model extensions for DeEn (left) and EnFr (right).

ing at the no-revision (NE = 0) curves, we see that         absolute time to translate a given source prefix re-
this configuration also benefits from modeling im-          mains small enough to be imperceptible. As in all
provements, but for DeEn, the deltas are noticeably         simultaneous systems, the largest source of latency
smaller than those of the few-revision curves.              is waiting for new source content to arrive.5

5.5      On computational complexity                        6    Conclusion
Re-translation is conceptually simple and easy to           We have presented the first comparison of re-
implement, but also incurs an increase in asymp-            translation and streaming strategies for simultane-
totic time complexity. If the base model can trans-         ous translation. We have shown re-translation with
late a sentence in time O(x), then re-translation           low levels of erasure (NE < 0.2) to be as good or
takes O(nx) where n is the source sentence length,          better than the state of the art in streaming transla-
due to repeated translation of the growing prefix.          tion. Also, re-translation easily embraces arbitrary
However, for many settings, this increase in com-           improvements to NMT, which we have highlighted
plexity can be easily ignored.
   We are not concerned with the total time to trans-       NVIDIA TITAN RTX, according to MLPerf v0.5 Inference
                                                            Closed NMT Stream. Retrieved from www.mlperf.org Novem-
late a sentence, but instead with the latency be-           ber 6th, 2019, entry Inf-0.5-27. MLPerf name and logo are
tween a new source word being uttered and its               trademarks. See www.mlperf.org for more information.
                                                                5
translation’s appearance on the screen. Modern                    We acknowledge the utility of low computational com-
                                                            plexity, both in reducing the cumulative burden alongside
accelerators can translate a complete sentence in           other sources of latency, such as network delays, and in re-
the range of 100 milliseconds,4 meaning that the            ducing power consumption for mobile or green computing.
                                                            Finding a lower-cost solution that preserves the flexibility of
   4
       24.43 milliseconds per sentence is achievable on a   re-translation is a worthy goal for future research.
with large gains from an upgraded base model.               Language Technologies, Volume 2 (Short Papers),
   In our setting, re-translation with no erasure re-       pages 493–499. Association for Computational Lin-
                                                            guistics.
duces to wait-k inference, which we have shown to
be much more effective than previously reported,          Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-
so long as the underlying NMT system’s training              tor O.K. Li. 2017. Learning to translate in real-time
data has been augmented with prefix pairs. Due to            with neural machine translation. In Proceedings of
                                                             the 15th Conference of the European Chapter of the
its simplicity and its effectiveness, we suggest it as       Association for Computational Linguistics: Volume
a strong baseline for future research on simultane-          1, Long Papers, pages 1053–1062. Association for
ous translation.                                             Computational Linguistics.

                                                          Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
                                                            method for stochastic optimization. In International
References                                                  Conference on Learning Representations.
Naveen Arivazhagan, Colin Cherry, Te I, Wolfgang
  Macherey, Pallavi Baljekar, and George Foster.          Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
  2019a. Re-translation strategies for long form, si-       Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
  multaneous, spoken language translation. CoRR,            Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
  abs/1912.03393.                                           Haifeng Wang. 2019a. STACL: Simultaneous trans-
                                                            lation with implicit anticipation and controllable la-
Naveen Arivazhagan, Colin Cherry, Wolfgang                  tency using prefix-to-prefix framework. In Proceed-
  Macherey, Chung-Cheng Chiu, Semih Yavuz,                  ings of the 57th Annual Meeting of the Association
  Ruoming Pang, Wei Li, and Colin Raffel. 2019b.            for Computational Linguistics, pages 3025–3036,
  Monotonic infinite lookback attention for simulta-        Florence, Italy. Association for Computational Lin-
  neous machine translation. In Proceedings of the          guistics.
  57th Annual Meeting of the Association for Com-
  putational Linguistics, pages 1313–1323, Florence,      Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and
  Italy. Association for Computational Linguistics.         Jiatao Gu. 2019b. Monotonic multihead attention.
                                                            CoRR, abs/1909.12406.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
   ton. 2016. Layer Normalization. arXiv e-prints,        Jan Niehues, Thai Son Nguyen, Eunah Cho, Thanh-Le
   page arXiv:1607.06450.                                    Ha, Kevin Kilgour, Markus Müller, Matthias Sper-
                                                             ber, Sebastian Stüker, and Alex Waibel. 2016. Dy-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-             namic transcription for low-latency speech transla-
  gio. 2014. Neural machine translation by jointly           tion. In Interspeech, pages 2513–2517.
  learning to align and translate. arXiv preprint
  arXiv:1409.0473.                                        Jan Niehues, Ngoc-Quan Pham, Thanh-Le Ha,
                                                             Matthias Sperber, and Alex Waibel. 2018. Low-
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin                latency neural speech translation. In Proc. Inter-
  Johnson, Wolfgang Macherey, George Foster, Llion           speech 2018, pages 1293–1297.
  Jones, Mike Schuster, Noam Shazeer, Niki Parmar,
  Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser,         Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Zhifeng Chen, Yonghui Wu, and Macduff Hughes.             Jing Zhu. 2002. Bleu: a method for automatic eval-
  2018. The best of both worlds: Combining recent           uation of machine translation. In Proceedings of the
  advances in neural machine translation. In Proceed-       40th Annual Meeting of the Association for Compu-
  ings of the 56th Annual Meeting of the Association        tational Linguistics.
  for Computational Linguistics (Volume 1: Long Pa-
  pers), pages 76–86. Association for Computational       Rico Sennrich, Barry Haddow, and Alexandra Birch.
  Linguistics.                                              2016. Neural machine translation of rare words with
                                                            subword units. In Proceedings of the 54th Annual
Colin Cherry and George Foster. 2019. Thinking slow         Meeting of the Association for Computational Lin-
  about latency evaluation for simultaneous machine         guistics (Volume 1: Long Papers), pages 1715–1725.
  translation. CoRR, abs/1906.00048.                        Association for Computational Linguistics.

Kyunghyun Cho and Masha Esipova. 2016. Can neu-           Jonathan Shen, Patrick Nguyen, Yonghui Wu, et al.
  ral machine translation do simultaneous translation?      2019. Lingvo: a modular and scalable frame-
  CoRR, abs/1606.02012.                                     work for sequence-to-sequence modeling. CoRR,
                                                            abs/1902.08295.
Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and
  Stephan Vogel. 2018. Incremental decoding and           Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
  training methods for simultaneous translation in neu-     Jonathon Shlens, and Zbigniew Wojna. 2016. Re-
  ral machine translation. In Proceedings of the 2018       thinking the inception architecture for computer vi-
  Conference of the North American Chapter of the           sion. 2016 IEEE Conference on Computer Vision
  Association for Computational Linguistics: Human          and Pattern Recognition (CVPR), pages 2818–2826.
Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
  Huang. 2019a. Simpler and faster learning of
  adaptive policies for simultaneous translation. In
  Proceedings of the 2019 Conference on Empirical
  Methods in Natural Language Processing and the
  9th International Joint Conference on Natural Lan-
  guage Processing (EMNLP-IJCNLP), pages 1349–
  1354, Hong Kong, China. Association for Computa-
  tional Linguistics.
Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
  Huang. 2019b. Speculative beam search for simulta-
  neous translation. In Proceedings of the 2019 Con-
  ference on Empirical Methods in Natural Language
  Processing and the 9th International Joint Confer-
  ence on Natural Language Processing (EMNLP-
  IJCNLP), pages 1395–1402, Hong Kong, China. As-
  sociation for Computational Linguistics.
You can also read