Dynamic Position Encoding for Transformers

Page created by Byron Silva

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Dynamic Position Encoding for Transformers

Dynamic Position Encoding for Transformers
                                                               Joyce Zheng∗           Mehdi Rezagholizadeh
                                                           Huawei Noah’s Ark Lab      Huawei Noah’s Ark Lab
                                                         jy6zheng@uwaterloo.ca mehdi.rezagholizadeh@huawei.com

                                                                                     Peyman Passban †
                                                                                         Amazon
                                                                               passban.peyman@gmail.com

                                                                 Abstract                              of complexity and importance, and is handled with
                                                                                                       a variety of statistical as well as string- and tree-
                                             Recurrent models have been dominating the
                                             field of neural machine translation (NMT) for
                                                                                                       based solutions (Bisazza and Federico, 2016).
                                             the past few years. Transformers (Vaswani                    As evident in SMT, the structure and position
arXiv:2204.08142v1 [cs.CL] 18 Apr 2022

                                             et al., 2017), have radically changed it by               of words within a sentence is crucial for accurate
                                             proposing a novel architecture that relies on             translation. The importance of such information
                                             a feed-forward backbone and self-attention                can also be explored in NMT and for Transform-
                                             mechanism. Although Transformers are pow-                 ers. Previous literature, such as Chen et al. (2020),
                                             erful, they could fail to properly encode se-
                                                                                                       demonstrated that source input sentences enriched
                                             quential/positional information due to their
                                             non-recurrent nature. To solve this problem,              with target order information have the capability to
                                             position embeddings are defined exclusively               improve translation quality in neural models. They
                                             for each time step to enrich word informa-                showed that position encoding seems to play a key
                                             tion. However, such embeddings are fixed af-              role in translation and this motivated us to further
                                             ter training regardless of the task and the word          explore this area.
                                             ordering system of the source or target lan-
                                                                                                          Since Transformers have a non-recurrent archi-
                                             guage.
                                                                                                       tecture, they could face problems when encoding
                                             In this paper, we propose a novel architecture            sequential data. As a result, they require an explicit
                                             with new position embeddings depending on
                                                                                                       summation of the input embeddings with position
                                             the input text to address this shortcoming by
                                             taking the order of target words into consid-             encoding to provide information about the order of
                                             eration. Instead of using predefined position             each word. However, this approach falsely assumes
                                             embeddings, our solution generates new em-                that the correct position of each word is always its
                                             beddings to refine each word’s position infor-            original position in the source sentence. This inter-
                                             mation. Since we do not dictate the position              pretation might be true when only considering the
                                             of source tokens and learn them in an end-                source side, whereas we know from SMT (Bisazza
                                             to-end fashion, we refer to our method as dy-
                                                                                                       and Federico, 2016; Cui et al., 2016) that arranging
                                             namic position encoding (DPE). We evaluated
                                             the impact of our model on multiple datasets              input words with respect to the order of their target
                                             to translate from English into German, French,            peers can lead to better results.
                                             and Italian and observed meaningful improve-                 In this work, we explore the use of injecting
                                             ments in comparison to the original Trans-                target position information alongside the source
                                             former.                                                   words to enhance Transformers’ translation quality.
                                         1   Introduction                                              We first examine the accuracy and efficiency of a
                                                                                                       two-pass Transformer (2PT), which consists of a
                                         In statistical machine translation (SMT), the gen-            pipeline connecting two Transformers, where the
                                         eral task of translation consists of reducing the             first Transformer reorders the source sentence and
                                         input sentence into smaller units (also known as              the second one translates the reordered sentences.
                                         statistical phrases), selecting an optimal translation        Although, this approach incorporates order infor-
                                         for each unit, and placing them in the correct order          mation from the target language, it lacks end-to-end
                                         (Koehn, 2009). The last step, which is referred to            training and requires more resources than a typical
                                         as the word reordering problem, is a great source             Transformer. Accordingly, we introduce a better
                                             ∗
                                                 Work done while Joyce Zheng was an intern at Huawei   alternative, which effectively learns the reordered
                                             †
                                                 Work done while Peyman Passban was at Huawei.         positions in an end-to-end fashion and uses this

information with the original source sequence to a classifier after translation to correct the position
boost translation accuracy. We refer to this alterna- of words that are placed in the wrong order. The
tive as Dynamic Position Encoding (DPE). entire reordering process can also be embedded
Our contribution in this work is threefold: into the decoding process (Feng et al., 2013).

• First, we demonstrate that providing source- 2.2 Tackling the Order Problem in NMT
side representations with target position infor- Pre-reordering and position encoding are also com-
mation improves translation quality in Trans- mon in NMT and have been explored by different
formers. researchers. Du and Way (2017) studied if recur-
rent neural models can benefit from pre-reordering.
• We also propose a novel architecture, DPE,
Their findings show that these models might not
that efficiently learns reordered positions in
require any order adjustments because the network
an end-to-end fashion and incorporates such
itself is powerful enough to learn such mappings.
information into the encoding process.
Kawara et al. (2020), unlike the previous work,
• Finally, using a preliminary two-pass design studied the same problem and reported promising
we show the importance of end-to-end learn- observations on the usefulness of pre-reordering.
ing in this problem. They used a transduction-grammar-based method
with recursive neural networks and showed how
2 Background impactful pre-reordering could be.
Liu et al. (2020) followed a different approach
2.1 Pre-Reordering in Machine Translation and proposed to model position encoding as a
In standard SMT, pre-reordering is a well-known continuous dynamical system through a neural
technique. Usually, the source sentence is re- ODE. Ke et al. (2020) investigated improving po-
ordered using heuristics such that it follows the sitional encoding by untying the relationship be-
word order of the target language. Figure 1 illus- tween words and positions. They suggest that there
trates this concept with an imaginary example. is no strong correlation between words and abso-
lute positions, so they remove this noisy correlation.
This form of separation has its own advantages, but
by removing this relationship between words and
positions from the translation process, they lose
valuable semantic information about the source
and target sides.
Figure 1: The order of source words before (left-hand Shaw et al. (2018a) explored a relative position
side) and after (right-hand side) pre-reordering. Si and encoding method in Transformers by having the
Tj show the i-th source and j-th target words, accord- self-attention mechanism consider the distance be-
ingly. tween source words. Garg et al. (2019) incorpo-
rated target position information via multitasking
As the figure shows, the original alignment be- where a translation loss is combined with an align-
tween source (Si ) and target (Ti ) words are used to ment loss that supervises one decoder head to learn
define a new order for the source sentence. With position information. Chen et al. (2020) changed
the new order, the translation engine does not need the Transformer architecture to incorporate order
to learn the relation between source and target or- information at each layer. They select a reordered
dering systems, as it directly translates from one target word position from the output of each layer
position on the source side to the same position on and inject it into the next layer. This is the clos-
the target side. Clearly, this can significantly reduce est work to ours, so we consider it as our main
the complexity of the translation task. The work baseline.
of Wang et al. (2007), provides a good example of
systems with pre-reordering, in which the authors 3 Methodology
studied this technique for the English–Chinese pair.
The concept of re-ordering does not necessarily 3.1 Two-Pass Translation for Pre-Reordering
need to be tackled prior to translation; in Koehn Our goal is to boost translation quality in Trans-
et al. (2007), generated sequences are reviewed by formers by injecting target order information into

the source-side encoding process. We propose dy-              As the figure demonstrates, given a pair of
namic position encoding (DPE) to achieve this,              English–German sentences, word alignments are
but before that, we discuss a preliminary two-pass          generated. In order to process the alignments, we
Transformer (2PT) architecture to demonstrate the           designed rules to handle different cases:
impact of order information in Transformers.
   The main difference between 2PT and DPE is                  • One-to-Many Alignments: We only con-
that DPE is an end-to-end, differential solution                 sider the first target position in one-to-many
whereas 2PT is a pipeline to connect two different               alignments (see the figure).
Transformers. They both work towards leveraging
                                                               • Many-to-One Alignments: Multiple source
order information to improve translation but in dif-
                                                                 words are reordered together (as one unit
ferent fashions. The 2PT architecture is illustrated
                                                                 while maintaining their relative positions with
in Figure 2.
                                                                 each other) using the position of the corre-
                                                                 sponding target word.

                                                               • No Alignment: Words that do not have any
                                                                 alignments are skipped and we do not change
                                                                 their position

                                                               • We also ensure that no source word would
                                                                 be aligned with a position beyond the source
                                                                 sentence length.

                                                               Considering these rules and what FastAlign
                                                            generates, the example input sentence “mr hän@@
                                                            sch represented you on this occasion .” (in Figure
                                                            3) is reordered to “mr hän sch you this represented
Figure 2: The two-pass Transformer architecture. The        .” @@ are auxiliary symbols added in-between
input sequence is first re-ordered to a new form which      subword units during preprocessing. See Section 4
is less complex for the translation Transformer. Then,      for more information.
the final translation is carried out by decoding a target      Using re-ordered sentences, the first Transformer
sequence using the re-ordered input sequence.
                                                            in 2PT is trained to learn to reorder the origi-
                                                            nal source sentences, and the second Transformer,
   2PT has two different Transformers. The first            which is responsible for translation, receives re-
one is used for reordering purposes instead of trans-       ordered source sentences and maps them into their
lation. It takes source sentences and generates a           target translations. Despite different data formats
reordered version of them, e.g. referring back to           and different inputs/outputs, 2PT is still a pipeline
Figure 1, if the input to the first Transformer is [S1 ,    that translates a source language to a target one,
S2 , S3 , S4 ] the expected output should be [S1 , S4 ,     through internal modifications which are hidden
S3 , S2 ]. In order to train such a reordering model,       from the user.
we created a new corpus using FastAlign (Dyer
et al., 2013).1                                             3.2   Dynamic Position Encoding
   FastAlign is an unsupervised word aligner
                                                            Unlike 2PT, the dynamic position encoding (DPE)
that processes source and target sentences together
                                                            method takes the advantage of end-to-end training,
and provides word-level alignments. It is usable
                                                            while the source side still learns target reordering
at training time but not for inference because it
                                                            position information. It boosts the input of an or-
requires access to both sides and we only have the
                                                            dinary Transformer’s encoder with target position
source side (at test time). As a solution, we create
                                                            information, but leaves its architecture untouched,
a training set using the alignments and utilized it to
                                                            as illustrated in Figure 4.
train the first Transformer in 2PT. Figure 3 shows
                                                               The input to DPE is a source word embedding
the input and output format in FastAlign and
                                                            (wi ) summed with sinusoidal position (pi ) encod-
how it helps generate training samples.
                                                            ing (wi ⊕ pi ). We refer to these (summed) embed-
   1
       https://github.com/clab/fast_align                   dings as enriched embeddings. Sinusoidal position

Figure 3: FastAlign-based reordering using a sample sentence from our English–German dataset. Using word
alignments, we generate a new, reordered form for each target sequence, we then use the pairs of source and new
target sequences to train the first Transformer of 2PT.

Figure 4: The figure on the left-hand side shows the original Transformer’s architecture and the figure on the right
is our proposed architecture. E1 is the first encoder layer of the ordinary Transformer and DP1 is the first layer of
the DPE network.

encoding is part of the original design of Trans- adding pi , but the original position of the word is
formers and we assume the reader is familiar with not always the best one for translation; ri is defined
this concept. For more details see Vaswani et al. to address this problem. If wi appears in the i-th
(2017). position but j is its best position with respect to the
DPE is another neural network placed in- target language, ri is supposed to learn information
between the enriched embeddings and the first en- about the j-th position and mimic pj . Accordingly,
coder layer of the (translation) Transformer. In the combination of pi and ri should provide wi
other words, the DPE network is connected to in- with the pre-reordering information it requires to
put the embedding table of the Transformer from improve translation.
one end, and its final layer outputs into the first In our design, DPE consists of two Transformer
encoder layer of the Transformer. Thus, the DPE layers. We determined this number through an
network can be trained jointly with the Transformer empirical study to find a reasonable balance be-
using the original parallel sentence pairs. tween translation quality and resource consump-
DPE processes enriched embeddings and gener- tion. These two layers are connected to an auxiliary
ates a new form of them that are represented as ri in loss function to ensure that the output of DPE is
this paper, i.e. DPE(wi ⊕ pi ) = ri . DPE-generated what we need for re-ordering.
embeddings are intended to preserve target-side This additional loss measures the mean squared
order information about each word. In the origi- error between the embeddings produced by DPE
nal Transformer, the position of wi is dictated by (ri ) and the supervising positions (P E) defined by

FastAlign alignments. This learning process is To prepare the data, sequences were lower-cased,
formulated in Equation 1: normalized, and tokenized using the scripts pro-
P|S| vided by the Moses toolkit4 (Koehn et al., 2007)
M SE(P Ei , ri ) and decomposed into subwords via Byte-Pair En-
Lorder = i=1 (1)
|S| coding (BPE) (Sennrich et al., 2016). The vo-
cabulary size extracted for the IWSLT and WMT
where S is a source sequence and MSE() is the
datasets are 32K and 40K, respectively. For the
mean-square error function. The supervising posi-
En–De pair of WMT-14, newstest2013 is used as
tion P Ei is obtained by taking the target position
a development set and newstest2014 is our test set.
(associated with wi ) defined by the FastAlign
For the IWSLT experiments, our test and develop-
described in Section 3.1.
ment sets are as suggested by Zhu et al. (2020).
To clarify how Lorder works, we use the afore-
Table 1 provides the statistics of our datasets.
mentioned scenario as an example. We assumed
that the correct position for wi according to
Data Train Dev Test
FastAlign alignments is j, so in our setting
WMT-14 (En→De) 4.45M 3k 3k
P Ei = pj and we thus compute M SE(pj , ri ).
IWSLT-14 (En→De) 160k 7k 6k
Through this technique, we encourage the DPE
IWSLT-14 (En→Fr) 168k 7k 4k
network to learn pre-reordering in an end-to-end
IWSLT-14 (En→It) 167k 7k 6k
fashion and provide wi with position refinement
information. Table 1: Statistics of datasets used in our experiments.
The total loss function when training the entire Train, Dev, and Test stand for the training, develop-
model includes the auxiliary reordering loss func- ment, and test sets, respectively.
tion Lorder summed with the standard Transformer
loss Ltranslation , as in Equation 2:
4.2 Experimental Setup
Ltotal = (1 − λ) × Ltranslation + λ × Lorder (2) In the interest of fair comparisons, we used the
where λ is a hyper-parameter that represents the same setup as Chen et al. (2020) to build our base-
weight of the reordering loss. λ was determined by line for the WMT-14 En–De experiments. This
minimizing the total loss on the development set baseline setting was also used for our DPE model
during training. and DPE related experiments. Our models were
trained on 8 × V100 GPUs. Also, since our mod-
4 Experimental Study els rely on the Transformers backbone, all hyper-
parameters that were related to the main Trans-
4.1 Dataset former architecture, such as embedding dimen-
To train and evaluate our models, we used the sions, the number of attention heads etc., were
IWSLT-14 collection (Cettolo et al., 2012) and set to the default values proposed for Transformer
the WMT-14 dataset.2 Our datasets are commonly Base in Vaswani et al. (2017). Refer to the original
used in the field, which makes our results easily work for detailed information.
reproducible. Our code is also publicly available For IWSLT experiments, we used a lighter archi-
to help other researchers further investigate this tecture since the datasets are smaller than WMT,
topic.3 The IWSLT-14 collection is used to study where the hidden dimension was 256 for all en-
the impact of our model for the English–German coder and decoder layers, and a dimension of 1024
(En–De), English–French (En–Fr), and English– was used for the inner feed-forward network layer.
Italian (En–It) pairs. We also report results on There were 2 encoder and 2 decoder layers, and 2
the WMT dataset, which provides a larger training attention heads. We found this setting through an
corpus. We know that the quality of NMT mod- empirical study to maximize the performance of
els vary in proportion to the corpus size, so these our IWSLT models.
experiments provide more information to better For the WMT-14 En–De experiments, similar to
understand our model. Chen et al. (2020), we trained the model for 300K
2
http://statmt.org/wmt14/ updates and used a single model obtained from
translation-task.html
3 4
https://github.com/jy6zheng/ https://github.com/moses-smt/
DynamicPositionEncodingModule mosesdecoder

averaging the last 5 checkpoints. The model was a +0.81 improvement in the BLEU score compared
validated with an interval of 2K on the development to Transformer Base. To ensure that we evaluated
dataset. The decoding beam size was 5. In the DPE in a fair setup, we re-implemented the Trans-
IWSLT-14 cases, we trained the models for 15,000 former Base in our own environment. This elim-
updates and used a single model obtained from inates the impact of different factors and ensures
averaging the last 5 checkpoints that were validated that the gain is due to the design of the DPE mod-
with an interval of 1000 updates. We evaluated ule itself. We also compare our model with models
all our models with detokenized BLEU (Papineni discussed in the related literature such as that of
et al., 2002). Shaw et al. (2018b), the reordering embeddings of
Chen et al. (2019), and the more recent explicit
4.3 2PT Experiments reordering embeddings of Chen et al. (2020). Our
Results related to the two-pass architecture are sum- model achieves the best score and we believe it is
marized in Table 2. The reordering Transformer due to the direct use of order information.
(Row 1) works with the source sentences and re- For the DPE architecture, we decided to have
orders them with respect to the order of the target two layers (DP1 and DP2) as it produced the
language. This is a monolingual translation task best measured BLEU scores on the development
with a BLEU score of 35.21. This is a relatively sets without imposing significant training overhead.
low score for a monolingual setting and indicates One important hyper-parameter that directly affects
how complicated the reordering problem could be. DPE’s performance is the position loss weight (λ).
Even dedicating a whole Transformer does not fully We ran an ablation study on the development set to
tackle the reordering problem. This finding also adjust λ. Table 4 summarizes our findings for this
indicates that NMT engines can benefit from an investigation. The best λ value in our setting is 0.5.
auxiliary module to handle order complexities. It is This value provides an acceptable balance between
usually assumed that the translation engine should the translation accuracy and pre-reordering costs
perform in an end-to-end fashion, where it deals during training.
with all reordering, translation, and other complex-
ities via a single model at the same time. However, The design of our Transformer (Transformer
if we can separate these sub-tasks systematically Base + DPE) might raise the concern that incor-
and tackle them individually, we might be able to porating pre-reordering information or defining an
improve the overall quality. auxiliary loss might not be necessary. One might
In Row 3, we consume the information gener- suggest that if we use the same amount of resources
ated previously (in Row 1) and show how a transla- to increase the number of Transformer Base’s en-
tion model performs when it is fed with reordered coder parameters, we should obtain competitive or
sentences. The BLEU score for this task is 21.96, even better results than the DPE-enhanced Trans-
which is significantly lower compared to the base- former. To address this concern, we designed an-
line (Row 2). Order information was supposed other experiment where we increase the number of
to increase the overall performance, but we see a parameters/layers in the Transformer Base encoder
degradation. This is because the first Transformer to match the number of our model’s parameters.
is unable to detect the correct order (due to the dif- Results related to this experiment are shown in
ficulty of this task). In Row 4, we feed the same Table 5.
translation engines with higher-quality order in- The comparison between DPE and different ex-
formation and BLEU rises to 31.82. We cannot tensions of Transformer Base, namely 8E (8 en-
use FastAlign at test time but this experiment coder layers) and 10E (10 encoder layers), demon-
shows that our hypothesis on the usefulness of or- strates that the increase in BLEU is due to the po-
der information seems to be correct. Motivated sition information provided by DPE rather than
by this, we invented DPE to better leverage order additional parameters of the DPE layers. In 8E, we
information, and these results are reported in the provided the same number of additional parameters
next section. as the DPE module adds, but experienced less gain
in translation quality than DPE. In 10E, we even
4.4 DPE Experiments doubled the number of additional parameters to sur-
Results related to DPE are reported in Table 3. pass the number of parameters that DPE uses, and
According to the reported scores, DPE leads to yet the DPE extension with 8 encoding layers (two

Model                                                Data type            BLEU Score
        1    Reordering Transformer                               En →Enreordered      35.21
        2    Transformer Base                                     En→De                27.76
        3    + fed with the output of reordering Transformer      Enreordered →De      21.96
        4    + fed with the output of FastAlign                   Enreordered →De      31.82

                            Table 2: BLEU scores for the 2PT series of experiments.

   Model                                                                    # Params   En→De (WMT)
   Transformer Base (Vaswani et al., 2017)                                  65.0 M     27.30
   + Relative PE (Shaw et al., 2018b)                                       N/A        26.80
   + Explicit Global Reordering Embeddings (Chen et al., 2020)              66.5 M     28.44
   + Reorder fusion-based source representation (Chen et al., 2020)         66.5 M     28.55
   + Reordering Embeddings (Encoder Only) (Chen et al., 2019)               102.1 M    28.03
   + Reordering Embeddings (Encoder/Decoder) (Chen et al., 2019)            106.8 M    28.22
   Transformer Base (our re-implementation)                                 66.5 M     27.78
   Dynamic Position Encoding (λ = 0.5)                                      72.8 M     28.59

                         Table 3: BLEU score comparison of DPE versus other peers.

       Model              WMT’14 En→ De                  4.5    Experimental Results on Other
                                                                Languages
       Baseline           27.78
       DPE (λ = 0.1)      28.17                          In addition to the previously reported experiments,
       DPE (λ = 0.3)      28.16                          we evaluated the DPE model on different IWSLT-
       DPE (λ = 0.5)      28.59                          14 datasets of English–German (En–De), English–
       DPE (λ = 0.7)      27.98                          French (En–Fr), and English–Italian (En–It). Af-
                                                         ter tuning with different position loss weights on
Table 4: BLEU scores of DPE with different values for    the development set, we determined λ = 0.3 to be
λ.                                                       ideal for the IWSLT-14 setting. The results in Table
                                                         6 show that with DPE, the translation accuracy im-
                                                         proves for different settings and the improvement
                                                         was not only unique to the En–De WMT language
        Model             WMT’14 En→De
                                                         pair.
        Baseline          27.78                             Our DPE architecture works with a variety of lan-
        8E                28.07                          guage pairs with different sizes and this increases
        10E               28.54                          our confidence on the beneficial effect of order in-
        DPE (N = 2)       28.59 (+ 0.81)                 formation. It is usually hard to show the impact of
                                                         auxiliary signals in NMT models and this could be
Table 5: A BLEU score comparison of DPE against the      even harder with smaller datasets, but our IWSLT
baseline Transformer models with additional encoder      results are promising. Accordingly, it would not be
layers (8E for 8 encoder layers and 10E for 10 encoder   unfair to claim that DPE is useful regardless of the
layers)                                                  language and dataset size.

                                                         5     Conclusion and Future Work
for pre-reordering and six from the original transla-    In this paper, we first explored whether Transform-
tion encoder) is still superior. This reinforces the     ers would benefit from order signals. We then
idea that our DPE module improves translation ac-        proposed a new architecture, DPE, that generates
curacy by injecting position information alongside       embeddings containing target word position infor-
the encoder input.                                       mation to boost translation quality.

Model En→De En→Fr En→It
Transformer 26.42 38.86 27.94
DPE-based Extension 27.47 (↑ 1.05) 39.42 (↑ 0.56) 28.35 (↑ 0.41)

Table 6: BLEU results for different IWSLT-14 Language pairs.

The results obtained in our experiments demon- Jinhua Du and Andy Way. 2017. Pre-reordering
strate that DPE improves the translation process by for neural machine translation: helpful or harm-
ful? Prague Bulletin of Mathematical Linguistics,
helping the source side learn target position infor-
(108):171–181.
mation. The DPE model consistently outperformed
the baselines of related literature. It also showed Chris Dyer, Victor Chahuneau, and Noah A. Smith.
improvements with different language pairs and 2013. A simple, fast, and effective reparameter-
ization of IBM model 2. In Proceedings of the
dataset sizes.
2013 Conference of the North American Chapter of
Our experiments can provide the groundwork for the Association for Computational Linguistics: Hu-
further exploration of dynamic position encoding in man Language Technologies, pages 644–648, At-
Transformers. In our future work, we plan to study lanta, Georgia. Association for Computational Lin-
guistics.
DPE from a linguistics point of view to further an-
alyze its behaviour. We can also explore injecting Minwei Feng, J. Peter, and H. Ney. 2013. Advance-
order information into other language processing ments in reordering models for statistical machine
models through DPE or a similar mechanism. Such translation. In ACL.
information seems to be useful for tasks such as Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy,
dependency parsing or sequence tagging. and Matthias Paulik. 2019. Jointly learning to align
and translate with transformer models.

References Yuki Kawara, Chenhui Chu, and Yuki Arase. 2020.
Preordering encoding on transformer for translation.
Arianna Bisazza and Marcello Federico. 2016. Sur- IEEE/ACM Transactions on Audio, Speech, and Lan-
veys: A survey of word reordering in statistical guage Processing.
machine translation: Computational models and
language phenomena. Computational Linguistics, Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinking
42(2):163–205. positional encoding in language pre-training.

Mauro Cettolo, Christian Girardi, and Marcello Fed- Philipp Koehn. 2009. Statistical machine translation.
erico. 2012. WIT3: Web inventory of transcribed Cambridge University Press.
and translated talks. In Proceedings of the 16th An-
nual conference of the European Association for Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
chine Translation, pages 261–268, Trento, Italy. Eu- Callison-Burch, Marcello Federico, Nicola Bertoldi,
ropean Association for Machine Translation. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro Constantin, and Evan Herbst. 2007. Moses: Open
Sumita. 2019. Neural machine translation with resource toolkit for statistical machine translation. In
ordering embeddings. In Proceedings of the 57th Proceedings of the 45th Annual Meeting of the As-
Annual Meeting of the Association for Computa- sociation for Computational Linguistics Companion
tional Linguistics, pages 1787–1799, Florence, Italy. Volume Proceedings of the Demo and Poster Ses-
Association for Computational Linguistics. sions, pages 177–180, Prague, Czech Republic. As-
sociation for Computational Linguistics.
Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro
Sumita. 2020. Explicit reordering for neural ma- Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and
chine translation. Cho-Jui Hsieh. 2020. Learning to encode position
for transformer with continuous dynamical model.
Yiming Cui, Shijin Wang, and Jianfeng Li. 2016.
LSTM neural reordering feature for statistical ma- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
chine translation. In Proceedings of the 2016 Con- Jing Zhu. 2002. Bleu: a method for automatic eval-
ference of the North American Chapter of the As- uation of machine translation. In Proceedings of
sociation for Computational Linguistics: Human the 40th Annual Meeting of the Association for Com-
Language Technologies, pages 977–982, San Diego, putational Linguistics, pages 311–318, Philadelphia,
California. Association for Computational Linguis- Pennsylvania, USA. Association for Computational
tics. Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Neural machine translation of rare words with
  subword units.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
  2018a. Self-attention with relative position represen-
  tations. CoRR, abs/1803.02155.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
  2018b. Self-attention with relative position repre-
  sentations.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.
Chao Wang, Michael Collins, and Philipp Koehn. 2007.
  Chinese syntactic reordering for statistical machine
  translation. In Proceedings of the 2007 Joint Con-
  ference on Empirical Methods in Natural Language
  Processing and Computational Natural Language
  Learning (EMNLP-CoNLL), pages 737–745, Prague,
  Czech Republic. Association for Computational Lin-
  guistics.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
   Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020.
   Incorporating bert into neural machine translation.
   In International Conference on Learning Represen-
   tations.

You can also read