On the Difficulty of Translating Free-Order Case-Marking Languages

Page created by Annette Bell
 
CONTINUE READING
On the Difficulty of Translating Free-Order Case-Marking Languages

                                                             Arianna Bisazza              Ahmet Üstün        Stephan Sportel
                                                                             Center for Language and Cognition
                                                                                 University of Groningen
                                                                        {a.bisazza, a.ustun}@rug.nl, research@spor.tel

                                                              Abstract                             Why, then, do some language pairs have lower
                                                                                                translation accuracy? And, more specifically:
                                              Identifying factors that make certain lan-
                                              guages harder to model than others is es-
                                                                                                Are certain typological profiles more challenging
                                              sential to reach language equality in future      for current state-of-the-art NMT models? Every
                                              Natural Language Processing technologies.         language has its own combination of typological
arXiv:2107.06055v1 [cs.CL] 13 Jul 2021

                                              Free-order case-marking languages, such as        properties, including word order, morphosyntactic
                                              Russian, Latin or Tamil, have proved more         features and more (Dryer and Haspelmath, 2013).
                                              challenging than fixed-order languages for        Identifying language properties (or combinations
                                              the tasks of syntactic parsing and subject-       thereof) that pose major problems to the current
                                              verb agreement prediction. In this work, we
                                                                                                modeling paradigms is essential to reach language
                                              investigate whether this class of languages
                                              is also more difficult to translate by state-     equality in future MT (and other NLP) technolo-
                                              of-the-art Neural Machine Translation mod-        gies (Joshi et al., 2020), in a way that is orthogonal
                                              els (NMT). Using a variety of synthetic lan-      to data collection efforts. Among others, natural
                                              guages and a newly introduced translation         languages adopt different mechanisms to disam-
                                              challenge set, we find that word order flex-      biguate the role of their constituents: Flexible or-
                                              ibility in the source language only leads         der typically correlates with the presence of case
                                              to a very small loss of NMT quality, even
                                                                                                marking and, vice versa, fixed order is observed
                                              though the core verb arguments become im-
                                              possible to disambiguate in sentences with-
                                                                                                in languages with little or no case marking (Com-
                                              out semantic cues. The latter issue is in-        rie, 1981; Sinnemäki, 2008; Futrell et al., 2015b).
                                              deed solved by the addition of case marking.      Morphologically rich languages in general are
                                              However, in medium- and low-resource set-         known to be challenging for MT at least since the
                                              tings, the overall NMT quality of fixed-order     times of phrase-based statistical MT (Birch et al.,
                                              languages remains unmatched.                      2008) due to their larger and sparser vocabularies,
                                                                                                and remain challenging even for modern neural ar-
                                         1   Introduction                                       chitectures (Ataman and Federico, 2018; Belinkov
                                                                                                et al., 2017). By contrast, the relation between
                                         Despite the tremendous advances achieved in less
                                                                                                word order flexibility and MT quality has not been
                                         than a decade, Natural Language Processing re-
                                                                                                directly studied to our knowledge.
                                         mains a field where language equality is far from
                                                                                                   In this paper, we study this relationship using
                                         being reached (Joshi et al., 2020). In the field
                                                                                                strictly controlled experimental setups. Specifi-
                                         of Machine Translation, modern neural models
                                                                                                cally, we ask:
                                         have attained remarkable quality for high-resource
                                         language pairs like German-English, Chinese-              • Are current state-of-the-art NMT systems bi-
                                         English or English-Czech, with a number of stud-            ased towards fixed-order languages?
                                         ies claiming even human parity (Hassan et al.,
                                                                                                   • To what extent does case marking compen-
                                         2018; Bojar et al., 2018; Barrault et al., 2019;
                                                                                                     sate for the lack of a fixed order in the source
                                         Popel et al., 2020). These results may lead to the
                                                                                                     language?
                                         unfounded belief that NMT methods will perform
                                         equally well in any language pair, provided similar       Unfortunately parallel data is scarce in most of
                                         amounts of training data. In fact, several studies     the world languages (Guzmán et al., 2019), and
                                         suggest the opposite (Platanios et al., 2018; Ata-     corpora in different languages are drawn from dif-
                                         man and Federico, 2018; Bugliarello et al., 2020).     ferent domains. Exceptions exist, like the widely
used Europarl (Koehn, 2005), but represent a small                       VSO   follows the little cat the friendly dog
                                                                 Fixed
fraction of the large variety of typological feature                     VOS   follows the friendly dog the little cat
combinations attested in the world. This makes                                 follows the little cat#S the friendly dog#O
it very difficult to run a large-scale comparative               Free+Case                          OR
study and isolate the factors of interest from, e.g.,                          follows the friendly dog#O the little cat#S
domain mismatch effects. As a solution, we pro-                  Translation de kleine kat volgt de vriendelijke hond
pose to evaluate NMT on synthetic languages (Gu-
lordava and Merlo, 2016; Wang and Eisner, 2016;
                                                                 Table 1: Example sentence in different fixed/flexible-
Ravfogel et al., 2019) that differ from each other
                                                                 order English-based synthetic languages and their SVO
only by specific properties, namely: the order of                Dutch translation. The subject in each sentence is
main constituents, or the presence and nature of                 underlined. Artificial case markers start with #.
case markers (see example in Table 1).
   We use this approach to isolate the impact of
various source-language typological features on                  to infer syntactic relationships in their language
MT quality and to remove the typical confounders                 (Slobin, 1966). However, cross-linguistic studies
of corpus size and domain. Using a variety of syn-               have later revealed that children are equally pre-
thetic languages and a newly introduced challenge                pared to acquire both fixed-order and inflectional
set, we find that state-of-the-art NMT has little to             languages (Slobin and Bever, 1982).
no bias towards fixed-order languages, but only                     Coming to computational linguistics, data-
when a sizeable training set is available.                       driven MT and other NLP approaches were also
                                                                 historically developed around languages with re-
2       Free-order Case-marking Languages                        markably fixed order and very simple to moder-
                                                                 ately simple morphological systems, like English
The word order profile of a language is usually                  or French. Luckily, our community has been giv-
represented by the canonical order of its main                   ing increasing attention to more and more lan-
constituents, (S)ubject, (O)bject, (V)erb. For in-               guages with diverse typologies, especially in the
stance, English and French are SVO languages,                    last decade. So far, previous work has found
while Turkish and Hindi are SOV. Other, less com-                that free-order languages are more challenging
monly attested, word orders are VSO and VOS,                     for parsing (Gulordava and Merlo, 2015, 2016)
while OSV and OVS are extremely rare (Dryer,                     and subject-verb agreement prediction (Ravfogel
2013). While many other word order features ex-                  et al., 2019) than their fixed-order counterparts.
ist (e.g., noun/adjective), they often correlate with            This raises the question of whether word order
the order of main constituents (Greenberg, 1963).                flexibility also negatively affects MT quality.
   A different, but likewise important dimension is                 Before the advent of modern NMT, Birch et al.
that of word order freedom (or flexibility). Lan-                (2008) used the Europarl corpus to study how var-
guages that primarily rely on the position of a                  ious language properties affected the quality of
word to encode grammatical roles typically dis-                  phrase-based Statistical MT. Amount of reorder-
play rigid orders (like English or Mandarin Chi-                 ing, target morphological complexity, and histor-
nese), while languages that rely on case marking                 ical relatedness of source and target languages
can be more flexible allowing word order to ex-                  were identified as strong predictors of MT qual-
press discourse-related factors like topicalization.             ity. Recent work by Bugliarello et al. (2020),
Examples of highly flexible-order languages in-                  however, has failed to show a correlation between
clude languages as diverse as Russian, Hungarian,                NMT difficulty (measured by a novel information-
Latin, Tamil and Turkish.1                                       theoretic metric) and several linguistic properties
   In the field of psycholinguistics, due to the his-            of source and target language, including Mor-
torical influence of English-centered studies, word              phological Counting Complexity (Sagot, 2013)
order has long been considered the primary and                   and Average Dependency Length (Futrell et al.,
most natural device through which children learn                 2015a). While that work specifically aimed at en-
    1
                                                                 suring cross-linguistic comparability, the sample
     See Futrell et al. (2015b) for detailed figures of word
order freedom (measured by the entropy of subject and ob-
                                                                 on which the linguistic properties could be com-
ject dependency relation order) in a diverse sample of 34 lan-   puted (Europarl) was rather small and not very ty-
guages.                                                          pologically diverse, leaving our research questions
open to further investigation. In this paper, we         input’s position.2 Transformer has nowadays
therefore opt for a different methodology: namely,       surpassed recurrent encoder-decoder models in
synthetic languages.                                     terms of generic MT quality. Moreover, Choshen
                                                         and Abend (2019) have recently shown that
3   Methodology                                          Transformer-based NMT models are indifferent to
                                                         the absolute order of source words, at least when
Synthetic languages This paper presents two              equipped with learned positional embeddings. On
sets of experiments: In the first (§4), we create par-   the other hand, the lack of recurrence in Trans-
allel corpora using very simple and predictable ar-      formers has been linked to a limited ability to cap-
tificial grammars and small vocabularies (Lupyan         ture hierarchical structure (Tran et al., 2018; Hahn,
and Christiansen, 2002). See example in Table 1.         2020). To our knowledge, no previous work has
By varying the position of subject/verb/object and       studied the biases of either architectures towards
introducing case markers to the source language,         fixed-order languages in a systematic manner.
we study the biases of two NMT architectures in
optimal training data conditions and a fully con-        4    Toy Parallel Grammar
trolled setup, i.e. without any other linguistic cues
                                                         We start by evaluating our models on a pair of toy
that may disambiguate constituent roles. In the
                                                         languages inspired by the English-Dutch pair and
second set of experiments (§5), we move to a more
                                                         created using a Synchronous Context-Free Gram-
realistic setup using synthetic versions of the En-
                                                         mar (Chiang and Knight, 2006). Each sentence
glish language that differ from it in only one or
                                                         consists of a simple clause with a transitive verb,
few selected typological features (Ravfogel et al.,
                                                         subject and object. Both arguments are singu-
2019). For instance, the original sentence’s order
                                                         lar and optionally modified by an adjective. The
(SVO) is transformed to different orders, like SOV
                                                         source vocabulary contains 6 nouns, 6 verbs, 6
or VSO, based on its syntactic parse tree.
                                                         adjectives, and the complete corpus contains 10k
    In both cases, typological variations are intro-     generated sentence pairs. Working with such a
duced in the source side of the parallel corpora,        small, finite grammar allows us to simulate an oth-
while the target language remains fixed. In this         erwise impossible situation where the NMT model
way, we avoid the issue of non-comparable BLEU           can be trained on (almost) the totality of a lan-
scores across different target languages. Lastly,        guage’s utterances, canceling out data sparsity ef-
we make the simplifying assumption that, when            fects.3
verb-argument order varies from the canonical or-
der in a flexible-order language, it does so in a        Source Language Variants We consider three
totally arbitrary way. While this is rarely true in      source language variants, illustrated in Table 1:
practice, as word order may be predictable given
                                                             • fixed-order VSO;
pragmatics or other factors, we focus here on “the
extent to which word order is conditioned on the             • fixed-order VOS;
syntactic and compositional semantic properties
of an utterance” (Futrell et al., 2015b).                    • mixed-order (randomly chosen between VSO
                                                               or VOS) with nominal case marking.
Translation models We consider two widely
                                                         We choose these word orders so that, in the
used NMT architectures that crucially differ in
                                                         flexible-order corpus, the only way to disam-
their encoding of positional information: (i) Re-
                                                         biguate argument roles is case marking, realized
current sequence-to-sequence BiLSTM with at-
                                                         by simple unambiguous suffixes (#S and #O). The
tention (Bahdanau et al., 2015; Luong et al.,
                                                         target language is always fixed SVO. The same
2015) processes the input symbols sequentially
                                                         random split (80/10/10% training/validation/test)
and has each hidden state directly conditioned on
                                                         is applied to the three corpora.
that of the previous (or following, for the back-
                                                             2
ward LSTM) timestep (Elman, 1990; Hochreiter                   We use sinusoidal embeddings (Vaswani et al., 2017).
and Schmidhuber, 1997). (ii) The non-recurrent,          All our models are built using OpenNMT: https://
                                                         github.com/OpenNMT/OpenNMT-py
fully attention-based Transformer (Vaswani et al.,           3
                                                               Data and code to replicate the toy grammar experiments
2017) processes all input symbols in parallel re-        in this section are available at https://github.com/
lying on dedicated embeddings to encode each             573phn/cm-vs-wo
100                                    100                                     100
  75                                     75                                      80

  50                                     50                                      60
                                   vos                                     vos   40                              vos
  25                               vso   25                                vso                                   vso
                                   mix                                     mix   20                              mix
   0                                      0
         200   400   600    800     1000      200   400   600        800    1000      200    400    600    800    1000

       (a) BiLSTM with attention              (b) Large Transformer                   (c) Small Transformer

Figure 1: Toy language NMT sentence-level accuracy on validation set by number of training epochs. Source
languages: fixed-order VSO, fixed-order VOS, and mixed-order (VSO/VOS) with case marking. Target language:
always fixed SVO. Each experiment is repeated five times, and averaged results are shown.

NMT Setup As recurrent model, we trained a 2-              ilar translation quality than their fixed-order coun-
layer BiLSTM with attention (Luong et al., 2015)           terparts. In §5 we validate this hypothesis on more
with 500 hidden layer size. As Transformer mod-            naturalistic language data.
els, we trained one using the standard 6-layer con-
figuration (Vaswani et al., 2017) and a smaller            5        Synthetic English Variants
one with only 2 layers given the simplicity of
the languages. All models are trained at the               Experimenting with toy languages has its short-
word level using the complete vocabulary. More             comings, like the small vocabulary size and non-
hyper-parameters are provided in Appendix A.1.             realistic distribution of words and structures. In
Note that our goal is not to compare LSTM and              this section, we follow the approach of Ravfogel
Transformer accuracy to each other, but rather             et al. (2019) to validate our findings in a less con-
to observe the different trends across fixed- and          trolled but more realistic setup. Specifically, we
flexible-order language variants. Given the small          create several variants of the Europarl English-
vocabulary, we use sentence-level accuracy in-             French parallel corpus where the source sentences
stead of BLEU for evaluation.                              are modified by changing word order and adding
                                                           artificial case markers. We choose French as target
Results As shown in Figure 1, all models                   language because of its fixed order, SVO, and its
achieve perfect accuracy on all language pairs af-         relatively simple morphology.4 As Indo-European
ter 1000 training steps, except for the Large Trans-       languages, English and French are moderately re-
former on the free-order language, likely due to           lated in terms of syntax and vocabulary while be-
overparametrization (Sankararaman et al., 2020).           ing sufficiently distant to avoid a word-by-word
These results demonstrate that our NMT architec-           translation strategy in many cases.
tures are equally capable of modeling translation             Source language variants are obtained by trans-
of both types of language, when all other factors          forming the syntactic tree of the original sen-
of variation are controlled for. Nonetheless, a pat-       tences. While Ravfogel et al. (2019) could rely
tern emerges when looking at the learning curves           on the Penn Treebank (Marcus et al., 1993) for
within each plot: While the two fixed-order lan-           their monolingual task of agreement prediction,
guages have very similar learning curves, the free-        we instead need parallel data. For this reason, we
order language with case markers always requires           parse the English side of the Europarl v.7 corpus
slightly more training steps to converge. This is          (Koehn, 2005) using the Stanza dependency parser
also the case, albeit to a lesser extent, when the         (Qi et al., 2020; Manning et al., 2014). After pars-
mixed-order corpus is pre-processed by splitting           ing, we adopt a modified version of the synthetic
all case suffixes from the nouns (extra experiment         language generator by Ravfogel et al. (2019) to
not shown in the plot). This trend is notewor-             create the following English variants:5
thy, given the simplicity of our grammars and the
                                                                4
transparency of the case system. As our training                 According to the Morphological Counting Complexity
                                                           (Sagot, 2013) values reported by Cotterell et al. (2018), En-
sets cover a large majority of the languages, this         glish scores 6 (least complex), Dutch 26, French 30, Spanish
result might suggest that free-order natural lan-          71, Czech 195, and Finnish 198 (most complex).
                                                               5
guages need larger training datasets to reach a sim-             Our revised language generator is available at https:
• fixed-order:        either SVO, SOV, VSO or                 Original (no case):
     VOS;6                                                       The woman says her sisters often invited her for dinner.
                                                                 SOV (no case):
   • free-order: for each sentence in the corpus,                The woman her sisters her often invited for dinner say.
     one of the six possible orders of (Subject, Ob-             SOV, syncretic case marking (overt):
     ject, Verb) is chosen randomly;                             The woman.arg.sg her sisters.arg.pl she.arg.sg often in-
                                                                 vited.arg.pl for dinner say.arg.sg.
   • shuffled words: all source words are shuffled               SOV, unambiguous case marking (overt):
     regardless of their syntactic role. This is our             The woman.nsubj.sg her sisters.nsubj.pl she.dobj.sg of-
     lower bound, measuring the reordering abil-                 ten invited.dobj.sg.nsubj.pl for dinner say.nsubj.sg.
     ity of a model in the total absence of source-              SOV, unambiguous case (implicit):
     side order cues (akin to bag-of-words input).               The womankar her sisterskon shekin often in-
                                                                 vitedkinkon for dinner saykar.
To allow for a fair comparison with the artifi-                  SOV, unambiguous case (implicit with declensions):
cial case-marking languages, we remove number                    The womankar her sisterspon shekit often invitedkitpon
agreement features from verbs in all the above                   for dinner saykar.
variants (cf. says → say in Table 2).                            French translation:
   To answer our second research question, we ex-                La femme dit que ses soeurs l’invitaient souvent à dîner.
periment with two artificial case systems proposed
by Ravfogel et al. (2019) and illustrated in Table 2             Table 2: Examples of synthetic English variants and
(overt suffixes):                                                their (common) French translation. The full list of suf-
                                                                 fixes is provided in Appendix A.3.
   • unambiguous case system: suffixes indicat-
     ing argument role (subject/object/indirect ob-
     ject) and number (singular/plural) are added                All models use subword representation based on
     to the heads of noun and verb phrases;                      32k BPE merge operations (Sennrich et al., 2016),
   • syncretic case system: suffixes indicating                  except in the low-resource setup where this is re-
     number but not grammatical function are                     duced to 10k operations. More hyper-parameters
     added to the heads of main arguments, pro-                  are provided in Appendix A.1.
     viding only partial disambiguation of argu-
     ment roles. This system is inspired from sub-
                                                                 Data and Evaluation We train our models on
     ject/object syncretism in Russian.
                                                                 various subsets of the English-French Europarl
Syncretic case systems were found to be roughly                  corpus: 1,9M sentence pairs (high-resource),
as common as non-syncretic ones in a large sam-                  100K (medium-resource), 10K (low-resource).
ple of almost 200 world languages (Baerman and                   For evaluation, we use 5K sentences randomly
Brown, 2013). Case marking is always combined                    held-out from the same corpus. Given the impor-
with the fully flexible order of main constituents.              tance of word order to assess the correct transla-
As in (Ravfogel et al., 2019), English number                    tion of verb arguments into French, we compute
marking is removed from verbs and their argu-                    the reordering-focused RIBES7 metric (Isozaki
ments before adding the artificial suffixes.                     et al., 2010) in addition to the more commonly
                                                                 used BLEU (Papineni et al., 2002). In each exper-
5.1   NMT Setup                                                  iment, the source side of training and test data is
Models As recurrent model, we used a 3-layer                     transformed using the same procedure whereas the
BiLSTM with hidden size of 512 and MLP at-                       target side remains unchanged. We repeat each ex-
tention (Bahdanau et al., 2015). The Transformer                 periment 3 times (or 4 for languages with random
model has the standard 6-layer configuration with                order choice) and report the averaged results.
hidden size of 512, 8 attention heads, and sinu-
soidal positional encoding (Vaswani et al., 2017).                  7
                                                                      BLEU captures local word-order errors only indirectly
//github.com/573phn/rnn_typology                                 (lower precision of higher-order n-grams) and does not cap-
   6
     To keep the number of experiments manageable, we omit       ture long-range word-order errors at all. By contrast, RIBES
object-initial languages which are significantly less attested   directly measures correlation between the word ranks in the
among world languages (Dryer, 2013).                             reference and those in the MT output.
5.2   Challenge Set                                              Fixed-Order Variants All four tested fixed-
                                                                 order variants obtain very similar BLEU/RIBES
Besides syntactic structure, natural language of-
                                                                 scores on the Europarl-test. This is in line with
ten contains semantic and collocational cues that
                                                                 previous work in NMT showing that linguistically
help disambiguate the role of an argument. Small
                                                                 motivated pre-ordering leads to small gains (Zhao
BLEU/RIBES differences between our language
                                                                 et al., 2018) or none at all (Du and Way, 2017),
variants may indicate actual robustness of a model
                                                                 and that Transformer-based models are not bi-
to word order flexibility, but may also indicate that
                                                                 ased towards monotonic translation (Choshen and
a model relies on those cues rather than on syntac-
                                                                 Abend, 2019). On the challenge set, scores are
tic structure (Gulordava et al., 2018). To discern
                                                                 slightly more variable but a manual inspection re-
these two hypotheses, we create a challenge set
                                                                 veals that this is due to different lexical choices,
of 7,200 simple affirmative and negative sentences
                                                                 while word order is always correct for this group
where swapping subject and object leads to an-
                                                                 of languages. To sum up, in the high-resource
other plausible sentence.8 Each English sentence
                                                                 setup, our Transformer models are perfectly able
and its reverse are included in the test set together
                                                                 to disambiguate the core argument roles when
with the respective translations, as for example:
                                                                 these are consistently encoded by word order.
 (1) a. The president thanks the minister. /                     Fixed-Order vs Random-Order Somewhat
        Le président remercie le ministre.                       surprisingly, the Transformer results are only
                                                                 marginally affected by the random ordering of
      b. The minister thanks the president. /                    verb and core arguments. Recall that in the
         Le ministre remercie le président.                      ‘Random’ language all six possible permutations
                                                                 of (S,V,O) are equally likely. Thus, Transformer
The source side is then processed as explained in                shows an excellent ability to reconstruct the cor-
§5 and translated by the NMT model trained on                    rect constituent order in the general-purpose test
the corresponding language variant. Thus, trans-                 set. The picture is very different on the challenge
lation quality on this set reflects the extent to                set, where RIBES drops severely from 97.6 to
which NMT models have robustly learnt to detect                  74.1. These low results were to be expected given
verb arguments and their roles independently from                the challenge set design (it is impossible even for
other cues, which we consider an important sign                  a human to recognize subject from object in the
of linguistic generalization ability. For space con-             ‘Random, no case’ challenge set). Nonetheless,
straints we only present RIBES scores on the chal-               they demonstrate that the general-purpose set
lenge set.9                                                      cannot tell us whether an NMT model has learnt
                                                                 to reliably exploit syntactic structure of the source
5.3   High-Resource Results                                      language, because of the abundant non-syntactic
Table 3 reports the high-resource setting results.               cues. In fact, even when all source words are
The first row (original English to French) is given              shuffled, Transformer still achieves a respectable
only for reference and shows the overall highest                 25.8/71.2 BLEU/RIBES on the Europarl-test.
results. The BLEU drop observed when moving
to any of the fixed-order variants (including SVO)               Case Marking The key comparison in our study
is likely due to parsing flaws resulting in awkward              lies between fixed-order and free-order case-
reorderings. As this issue affects all our synthetic             marking languages. Here, we find that case mark-
variants, it does not undermine the validity of our              ing can indeed restore near-perfect accuracy on the
findings. For clarity, we center our main discus-                challenge set (98.1 RIBES). However, this only
sion on the Transformer results and comment on                   happens when the marking system is completely
the BiLSTM results at the end of this section.                   unambiguous, which, as already mentioned, is true
                                                                 for only about a half of the real case-marking lan-
    8
      More details can be found in Appendix A.2. We              guages (Baerman and Brown, 2013). Indeed the
release the challenge set at https://github.com/                 syncretic system visibly improves quality on the
arianna-bis/freeorder-mt
    9
      We also computed BLEU scores: they strongly correlate
                                                                 challenge set (74.1 to 84.4 RIBES) but remains far
with RIBES but fluctuate more due to the larger effect of lex-   behind the fixed-order score (97.6). In terms of
ical choice.                                                     overall NMT quality (Europarl-test), fixed-order
English*→French                                  BI - LSTM                             TRANSFORMER
Large Training (1.9M)              Europarl-Test              Challenge       Europarl-Test            Challenge
                                  BLEU         RIBES          RIBES         BLEU          RIBES        RIBES

Original English                  39.4         85.0           98.0          38.3         84.9          97.7
Fixed Order:
S-V-O                             38.3         84.5           98.1          37.7         84.6          98.0
S-O-V                             37.6         84.2           97.7          37.9         84.5          97.2
V-S-O                             38.0         84.2           97.8          37.8         84.6          98.0
V-O-S                             37.8         84.0           98.0          37.6         84.3          97.2
Average (fixed orders)            37.9±0.4     84.2±0.3       97.9±0.2      37.8±0.1     84.5±0.1      97.6±0.4
Flexible Order:
Random, no case                   37.1         83.7           75.1          37.5         84.2          74.1
Random + syncretic case           36.9         83.6           75.4          37.3         84.2          84.4
Random + unambig. case            37.3         83.9           97.7          37.3         84.4          98.1
Shuffle all words                 18.5         65.2           79.4          25.8         71.2          83.2

Table 3: Translation quality from various English-based synthetic languages into standard French, using the largest
training data (1.9M sentences). NMT architectures: 3-layer BiLSTM seq-to-seq with attention; 6-layer Trans-
former. Europarl-Test: 5K held-out Europarl sentences; Challenge set: see §5.2. All scores are averaged over three
training runs.

languages score only marginally higher than the              tations, Choshen and Abend (2019) might have
free-order case-marking ones, regardless of the              overestimated the bias of recurrent NMT towards
unambiguous/syncretic distinction. Thus our find-            more monotonic translation, whereas the more re-
ing that Transformer NMT systems are equally ca-             alistic combination of constituent-level reordering
pable of modeling the two types of languages (§4)            with case marking used in our study is not so prob-
is also confirmed with more naturalistic language            lematic for this type of model.
data. That said, we will show in Sect. 5.4 that this            Interestingly, on the challenge set, BiLSTM and
positive finding is conditional on the availability          Transformer perform on par, with the notable ex-
of large amounts of training samples.                        ception that syncretic case is much more difficult
                                                             for the BiLSTM model. Our results agree with the
BiLSTM vs Transformer The LSTM-based re-                     large drop of subject-verb agreement prediction
sults generally correlate with the Transformer re-           accuracy observed by Ravfogel et al. (2019) when
sults discussed above, however our recurrent mod-            experimenting with the random order of main con-
els appear to be slightly more sensitive to changes          stituents. However, their scores were also low
in the source-side order, in line with previous              for SOV and VOS, which is not the case in our
findings (Choshen and Abend, 2019). Specifi-                 NMT experiments. Besides the fact that our chal-
cally, translation quality on Europarl-test fluctu-          lenge set only contains short sentences (hence no
ates slightly more than Transformer among dif-               long dependencies and few agreement attractors),
ferent fixed orders, with the most monotonic or-             our task is considerably different in that agreement
der (SVO) leading to the best results. When                  only needs to be predicted in the target language,
all words are randomly shuffled, BiLSTM scores               which is fixed-order SVO.
drop much more than Transformer. However,
when comparing the fixed-order variants to the               Summary Our results so far suggest that state-
ones with free order of main constituents, BiL-              of-the-art NMT models, especially if Transformer-
STM shows only a slightly stronger preference for            based, have little or no bias towards fixed-order
fixed-order, compared to Transformer. This sug-              languages. In what follows, we study whether this
gests that, by experimenting with arbitrary permu-           finding is robust to differences in data size, type of
morphology, and target language.                                      • implicit: the combination of number and
                                                                        case is expressed by unique suffixes without
5.4    Effect of Data Size and Morphological                            internal structure (e.g. kar for .nsubj.sg, ker
       Features                                                         for .dobj.pl) similar to fusional languages.
Data Size The results shown in Table 3 rep-                             This system displays exponence (many:1);
resent a high-resource setting (almost 2M train-
                                                                      • implicit with declensions: like the previous,
ing sentences). While recent successes in cross-
                                                                        but with three different paradigms each ar-
lingual transfer learning alleviate the need for la-
                                                                        bitrarily assigned to a different subset of the
beled data (Liu et al., 2020), their success still de-
                                                                        lexicon. This system displays exponence and
pends on the availability of large unlabeled data
                                                                        flexivity (many:many).
as well as other, yet to be explained, language
properties (Joshi et al., 2020). We then ask: Do                   A complete overview of our morphological
free-order case-marking languages need more data                   paradigms is provided in Appendix A.3 All our
than fixed-order non-case-marking ones to reach                    languages have moderate inflectional synthesis
similar NMT quality? We simulate a medium-                         and, in terms of fusion, are exclusively concate-
and low-resource scenario by sampling 100K and                     native. Despite this, the effect on vocabulary size
10K training sentences, respectively, from the full                is substantial: 180% increase by overt and implicit
Europarl data. To reduce the number of exper-                      case marking, 250% by implicit marking with de-
iments, we only consider Transformer with one                      clensions (in the full data setting).
fixed-order language variant (SOV)10 and exclude
syncretic case marking. To disentagle the ef-                      Results Results are shown in the plots of Fig-
fect of word order from that of case marking on                    ure 2 (detailed numerical scores are given in Ap-
low-resource translation quality, we also exper-                   pendix A.4). We find that reducing training size
iment with a language variant combining fixed-                     has, not surprisingly, a major effect on transla-
order (SOV) and case marking. Results are shown                    tion quality. Among source language variants,
in Figure 2 and discussed below.                                   fixed-order obtains the highest quality across all
                                                                   setups. In terms of BLEU (2(a)), the spread
Morphological Features The artificial case sys-                    among variants increases somewhat with less data
tems used so far included easily separable suffixes                however differences are small. A clearer picture
with a 1:1 mapping between grammatical cate-                       emerges from RIBES (2(b)), whereby less data
gories and morphemes (e.g. .nsubj.sg, .dobj.pl)                    clearly leads to more disparity. This is already
reminiscent of agglutinative morphologies. Many                    visible in the 100k setup, with the fixed SOV lan-
world languages, however, do not comply to this                    guage dominating the others. Case marking, de-
1:1 mapping principle but display flexivity (multi-                spite being necessary to disambiguate argument
ple categories conveyed by one morpheme) and/or                    roles in the absence of semantic cues, does not im-
exponence (the same category expressed by var-                     prove translation quality and even degrades it in
ious, lexically determined, morphemes). Well-                      the low-resource setup. Looking at the challenge
studied examples of languages with case+number                     set results (2(c)) we see that the free-order case-
exponence include Russian and Finnish, while                       marking languages are clearly disadvantaged: In
flexive languages include, again, Russian and                      the mid-resource setup, case marking improves
Latin. Motivated by previous findings on the                       substantially over the underspecified random,no-
impact of fine-grained morphological features on                   case language but remains far behind fixed-order.
language modeling difficulty (Gerz et al., 2018),                  In low-resource, case marking notably hurts qual-
we experiment with three types of suffixes (see ex-                ity even in comparison with the underspecified
amples in Table 2):                                                language. These results thus demonstrate that
                                                                   free-order case-marking languages require more
   • overt: number and case are denoted by eas-                    data than their fixed-order counterparts to be ac-
     ily separable suffixes (e.g. .nsubj.sg, .dobj.pl)             curately translated by state-of-the-art NMT.11 Our
     similar to agglutinative languages (1:1);                     experiments also show that this greater learning
  10                                                                 11
    We choose SOV because it is a commonly attested word                In the light of this finding, it would be interesting to re-
order and is different from that of the target language, thereby   visit the evaluation of Bugliarello et al. (2020) in relation to
requiring some non-trivial reorderings during translation.         varying data sizes.
sov          85                               sov           100
                                 sov+overt                                     sov+overt                                           sov
                                 random                                        random                                              sov+overt
 35                              r+overt                                       r+overt                                             random
                                                                                                                                   r+overt
                                 r+implicit                                    r+implicit                                          r+implicit
                                 r+declens    80                               r+declens     90                                    r+declens
 30

                                              75
 25                                                                                          80

 20                                           70
                                                                                             70

 15
                                              65
                                                                                             60
 10

                                              60
      1.9M             100K             10K        1.9M             100K              10K          1.9M            100K                   10K

             (a) Europarl-Test BLEU                       (b) Europarl-Test RIBES                         (c) Challenge RIBES

Figure 2: EN*-FR Transformer NMT quality versus training data size (x-axis). Source language variants: Fixed-
order (SOV) and free-order (random) with different case systems (r+overt/implicit/declens). Scores averaged over
three training runs. Detailed numerical results are provided in Appendix A.4.

difficulty is not only due to case marking (and sub-                   and moderately flexible, syntactically determined
sequent data sparsity), but also to word order flexi-                  order.12
bility (compare sov+overt to r+overt in Figure 2).                        Figure 3 shows the results with 100k training
   Regarding different morphology types, we do                         sentences. In terms of BLEU, differences are
not observe a consistent trend in terms of overall                     even smaller than in English-French. In terms of
translation quality (Europarl-test): in some cases,                    RIBES, trends are similar across target languages,
the richest morphology (with declensions) slightly                     with the fixed SOV source language obtaining best
outperforms the one without declensions —a re-                         results and the case-marked source language ob-
sult that would deserve further exploration. On                        taining worst results. This suggests that the major
the other hand, results on the challenge set, where                    findings of our study are not due to the specific
most words are case-marked, show that morpho-                          choice of French as the target language.
logical richness inversely correlates with transla-
tion quality when data is scarce. We postulate that                    26                                  80
our artificial morphologies may be too limited in                      24
                                                                                              sov
                                                                                                           78
                                                                                              random
scope (only 3-way case and number marking) to                          22                     r+overt      76
impact overall translation quality and leave the in-                   20                                  74
vestigation of richer inflectional synthesis to future                 18                                  72
work.                                                                  16                                  70
                                                                       14                                  68
5.5      Effect of Target Language                                     12 en*-FR    en*-CS    en*-NL       66 en*-FR      en*-CS     en*-NL

All results so far involved translation into a fixed-                       (a) Europarl BLEU                   (b) Europarl RIBES
order (SVO) language without case marking. To
verify the generality of our findings, we repeat a                     Figure 3: Transformer results for more target languages
subset of experiments with the same synthetic En-                      (100k training size). Scores averaged over 2 runs.
glish variants, but using Czech or Dutch as target
languages. Czech has rich fusional morphology
including case marking, and very flexible order.                         12
                                                                            Dutch word order is very similar to German, with the
Dutch has simple morphology (no case marking)                          position of S, V, and O depending on the type of clause.
6   Related Work                                         art NMT models, especially Transformer-based
                                                         ones, have little or no bias towards fixed-order lan-
The effect of word order flexibility on NLP model
                                                         guages. Our simulated low-resource experiments,
performance has been mostly studied in the field
                                                         however, reveal a different picture, that is: free-
of syntactic parsing, for instance using Aver-
                                                         order case-marking languages require more data
age Dependency Length (Gildea and Temperley,
                                                         to be translated as accurately as their fixed-order
2010; Futrell et al., 2015a) or head-dependent or-
                                                         counterparts. Since parallel data (like labeled data
der entropy (Futrell et al., 2015b; Gulordava and
                                                         in general) is scarce for most of the world lan-
Merlo, 2016) as syntactic correlates of word or-
                                                         guages (Guzmán et al., 2019; Joshi et al., 2020),
der freedom. Related work in language model-
                                                         we believe this should be considered as a further
ing has shown that certain languages are intrinsi-
                                                         obstacle to language equality in future NLP tech-
cally more difficult to model than others (Cotterell
                                                         nologies.
et al., 2018; Mielke et al., 2019) and has further-
                                                            In future work, our analysis should be extended
more studied the impact of fine-grained morphol-
                                                         to target language variants using principled alter-
ogy features (Gerz et al., 2018) on LM perplexity.
                                                         natives to BLEU (Bugliarello et al., 2020), and
   Regarding the word order biases of seq-to-seq
                                                         to other typological features that are likely to af-
models, Chaabouni et al. (2019) use miniature lan-
                                                         fect MT performance, such as inflectional synthe-
guages similar to those of Sect. 4 to study the
                                                         sis and degree of fusion (Gerz et al., 2018). Fi-
evolution of LSTM-based agents in a simulated
                                                         nally, the synthetic languages and challenge set
iterated learning setup. Their results in a stan-
                                                         proposed in this paper could be used to evaluate
dard “individual learning” setup show, like ours,
                                                         syntax-aware NMT models (Eriguchi et al., 2016;
that a free-order case-marking toy language can be
                                                         Bisk and Tran, 2018; Currey and Heafield, 2019),
learned just as well as a fixed-order one, confirm-
                                                         which promise to better capture linguistic struc-
ing earlier results obtained by simple Elman net-
                                                         ture, especially in low-resource scenarios.
works trained for grammatical role classification
(Lupyan and Christiansen, 2002). Transformer
                                                         Acknowledgements
was not included in these studies. Choshen and
Abend (2019) measure the ability of LSTM- and            Arianna Bisazza was partly funded by the Nether-
Transformer-based NMT to model a language pair           lands Organization for Scientific Research (NWO)
where the same arbitrary (non syntactically mo-          under project number 639.021.646. We would like
tivated) permutation is applied to all source sen-       to thank the Center for Information Technology of
tences. They find that Transformer is largely indif-     the University of Groningen for providing access
ferent to the order of source words (provided this       to the Peregrine HPC cluster, and the anonymous
is fixed and consistent across training and test set)    reviewers for their helpful comments.
but nonetheless struggles to translate long depen-
dencies actually occurring in natural data. They
do not directly study the effect of order flexibility.   References
   The idea of permuting dependency trees to gen-
erate synthetic languages was introduced indepen-        Duygu Ataman and Marcello Federico. 2018. An
dently by Gulordava and Merlo (2016) (discussed            evaluation of two vocabulary reduction methods
above) and by Wang and Eisner (2016), the latter           for neural machine translation. In Proceedings
with the aim of diversifying the set of treebanks          of the 13th Conference of the Association for
currently available for language adaptation.               Machine Translation in the Americas (Volume
                                                           1: Research Papers), pages 97–110.
7   Conclusions
                                                         Matthew Baerman and Dunstan Brown. 2013.
We have presented an in-depth analysis of how             Case syncretism. In Matthew S. Dryer and
Neural Machine Translation difficulty is affected         Martin Haspelmath, editors, The World Atlas of
by word order flexibility and case marking in the         Language Structures Online. Max Planck Insti-
source language. Although these common lan-               tute for Evolutionary Anthropology, Leipzig.
guage properties were previously shown to nega-
tively affect parsing and agreement prediction ac-       Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
curacy, our main results show that state-of-the-           Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd In-     translation difficulty by cross-mutual informa-
  ternational Conference on Learning Represen-            tion. In Proceedings of the 58th Annual Meet-
  tations, ICLR 2015.                                     ing of the Association for Computational Lin-
                                                          guistics, pages 1640–1649, Online. Association
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,       for Computational Linguistics.
  Christian Federmann, Mark Fishel, Yvette Gra-
  ham, Barry Haddow, Matthias Huck, Philipp             Rahma Chaabouni, Eugene Kharitonov, Alessan-
  Koehn, Shervin Malmasi, Christof Monz,                  dro Lazaric, Emmanuel Dupoux, and Marco
  Mathias Müller, Santanu Pal, Matt Post, and             Baroni. 2019. Word-order biases in deep-agent
  Marcos Zampieri. 2019. Findings of the 2019             emergent communication. In Proceedings of
  conference on machine translation (WMT19).              the 57th Annual Meeting of the Association for
  In Proceedings of the Fourth Conference on              Computational Linguistics, pages 5166–5175,
  Machine Translation (Volume 2: Shared Task              Florence, Italy. Association for Computational
  Papers, Day 1), pages 1–61, Florence, Italy. As-        Linguistics.
  sociation for Computational Linguistics.
                                                        David Chiang and Kevin Knight. 2006. An in-
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,             troduction to synchronous grammars. Tuto-
  Hassan Sajjad, and James Glass. 2017. What do           rial available at http://www.isi.edu/
  neural machine translation models learn about           ~chiang/papers/synchtut.pdf.
  morphology? In Proceedings of the 55th An-
                                                        Leshem Choshen and Omri Abend. 2019. Auto-
  nual Meeting of the Association for Compu-
                                                          matically extracting challenge sets for non-local
  tational Linguistics (Volume 1: Long Papers),
                                                          phenomena in neural machine translation. In
  pages 861–872, Vancouver, Canada. Associa-
                                                          Proceedings of the 23rd Conference on Compu-
  tion for Computational Linguistics.
                                                          tational Natural Language Learning (CoNLL),
Alexandra Birch, Miles Osborne, and Philipp               pages 291–303, Hong Kong, China. Associa-
  Koehn. 2008. Predicting success in machine              tion for Computational Linguistics.
  translation. In Proceedings of the 2008 Con-          Benrard Comrie. 1981. Language Universals and
  ference on Empirical Methods in Natural Lan-            Linguistic Typology. Blackwell.
  guage Processing, pages 745–754, Honolulu,
  Hawaii. Association for Computational Lin-            Ryan Cotterell, Sabrina J. Mielke, Jason Eisner,
  guistics.                                               and Brian Roark. 2018. Are all languages
                                                          equally hard to language-model? In Proceed-
Yonatan Bisk and Ke Tran. 2018. Inducing gram-            ings of the 2018 Conference of the North Amer-
  mars with and for neural machine translation.           ican Chapter of the Association for Computa-
  In Proceedings of the 2nd Workshop on Neu-              tional Linguistics: Human Language Technolo-
  ral Machine Translation and Generation, pages           gies, Volume 2 (Short Papers), pages 536–541,
  25–35, Melbourne, Australia. Association for            New Orleans, Louisiana. Association for Com-
  Computational Linguistics.                              putational Linguistics.
Ondřej Bojar, Christian Federmann, Mark Fishel,        Anna Currey and Kenneth Heafield. 2019. Incor-
  Yvette Graham, Barry Haddow, Philipp Koehn,             porating source syntax into transformer-based
  and Christof Monz. 2018. Findings of the 2018           neural machine translation. In Proceedings of
  conference on machine translation (WMT18).              the Fourth Conference on Machine Translation
  In Proceedings of the Third Conference on Ma-           (Volume 1: Research Papers), pages 24–33,
  chine Translation: Shared Task Papers, pages            Florence, Italy. Association for Computational
  272–303, Belgium, Brussels. Association for             Linguistics.
  Computational Linguistics.
                                                        Matthew S. Dryer. 2013. Order of subject, ob-
Emanuele Bugliarello, Sabrina J. Mielke, An-             ject and verb. In Matthew S. Dryer and Martin
 tonios Anastasopoulos, Ryan Cotterell, and              Haspelmath, editors, The World Atlas of Lan-
 Naoaki Okazaki. 2020. It’s easier to translate          guage Structures Online. Max Planck Institute
 out of English than into it: Measuring neural           for Evolutionary Anthropology, Leipzig.
Matthew S. Dryer and Martin Haspelmath, editors.     Kristina Gulordava, Piotr Bojanowski, Edouard
 2013. WALS Online. Max Planck Institute for           Grave, Tal Linzen, and Marco Baroni. 2018.
 Evolutionary Anthropology, Leipzig.                   Colorless green recurrent networks dream hi-
                                                       erarchically.   In Proceedings of the 2018
Jinhua Du and Andy Way. 2017. Pre-reordering           Conference of the North American Chapter
   for neural machine translation: Helpful or          of the Association for Computational Linguis-
   harmful? The Prague Bulletin of Mathemati-          tics: Human Language Technologies, Volume 1
   cal Linguistics, 108(1):171–182.                    (Long Papers), pages 1195–1205, New Orleans,
                                                       Louisiana. Association for Computational Lin-
Jeffrey L. Elman. 1990. Finding structure in time.     guistics.
   Cognitive Science, 14(2):179–211.
                                                     Kristina Gulordava and Paola Merlo. 2015. Di-
Akiko Eriguchi, Kazuma Hashimoto, and Yoshi-           achronic trends in word order freedom and de-
  masa Tsuruoka. 2016. Tree-to-sequence atten-         pendency length in dependency-annotated cor-
  tional neural machine translation. In Proceed-       pora of Latin and ancient Greek. In Proceed-
  ings of the 54th Annual Meeting of the Asso-         ings of the Third International Conference on
  ciation for Computational Linguistics (Volume        Dependency Linguistics (Depling 2015), pages
  1: Long Papers), pages 823–833, Berlin, Ger-         121–130, Uppsala, Sweden. Uppsala Univer-
  many. Association for Computational Linguis-         sity, Uppsala, Sweden.
  tics.
                                                     Kristina Gulordava and Paola Merlo. 2016. Multi-
Richard Futrell, Kyle Mahowald, and Edward             lingual dependency parsing evaluation: a large-
  Gibson. 2015a. Large-scale evidence of de-           scale analysis of word order properties using ar-
  pendency length minimization in 37 languages.        tificial data. Transactions of the Association for
  Proceedings of the National Academy of Sci-          Computational Linguistics, 4:343–356.
  ences, 112(33):10336–10341.
                                                     Francisco Guzmán, Peng-Jen Chen, Myle Ott,
Richard Futrell, Kyle Mahowald, and Edward             Juan Pino, Guillaume Lample, Philipp Koehn,
  Gibson. 2015b. Quantifying word order free-          Vishrav Chaudhary, and Marc’Aurelio Ranzato.
  dom in dependency corpora. In Proceedings of         2019. The FLORES Evaluation Datasets for
  the Third International Conference on Depen-         Low-Resource Machine Translation: Nepali–
  dency Linguistics (Depling 2015), pages 91–          English and Sinhala–English. In Proceedings
  100, Uppsala, Sweden. Uppsala University, Up-        of the 2019 Conference on Empirical Meth-
  psala, Sweden.                                       ods in Natural Language Processing and the
                                                       9th International Joint Conference on Natu-
Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti,        ral Language Processing (EMNLP-IJCNLP),
  Roi Reichart, and Anna Korhonen. 2018. On            pages 6100–6113.
  the relation between linguistic typology and
                                                     Michael Hahn. 2020. Theoretical limitations
  (limitations of) multilingual language model-
                                                      of self-attention in neural sequence models.
  ing. In Proceedings of the 2018 Conference on
                                                      Transactions of the Association for Computa-
  Empirical Methods in Natural Language Pro-
                                                      tional Linguistics, 8:156–171.
  cessing, pages 316–327, Brussels, Belgium. As-
  sociation for Computational Linguistics.           Hany Hassan, Anthony Aue, Chang Chen, Vishal
                                                       Chowdhary, Jonathan Clark, Christian Fed-
Daniel Gildea and David Temperley. 2010. Do            ermann, Xuedong Huang, Marcin Junczys-
  grammars minimize dependency length? Cog-            Dowmunt, William Lewis, Mu Li, et al. 2018.
  nitive Science, 34(2):286–310.                       Achieving human parity on automatic Chinese
                                                       to English news translation. arXiv preprint
Joseph H. Greenberg. 1963. Some universals of          arXiv:1803.05567.
  grammar with particular reference to the order
  of meaningful elements. In Joseph H. Green-        Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  berg, editor, Universals of Human Language,          Long short-term memory. Neural Computation,
  pages 73–113. MIT Press, Cambridge, Mass.            9(8):1735–1780.
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Kat-           a large annotated corpus of English: The Penn
  suhito Sudoh, and Hajime Tsukada. 2010. Au-            Treebank.
  tomatic evaluation of translation quality for dis-
  tant language pairs. In Proceedings of the 2010      Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,
  Conference on Empirical Methods in Natural             Brian Roark, and Jason Eisner. 2019. What
  Language Processing, pages 944–952, Cam-               kind of language is hard to language-model?
  bridge, MA. Association for Computational              In Proceedings of the 57th Annual Meeting of
  Linguistics.                                           the Association for Computational Linguistics,
                                                         pages 4975–4989, Florence, Italy. Association
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Ka-        for Computational Linguistics.
  lika Bali, and Monojit Choudhury. 2020. The
  state and fate of linguistic diversity and inclu-    Kishore Papineni, Salim Roukos, Todd Ward, and
  sion in the NLP world. In Proceedings of               Wei-Jing Zhu. 2002. BLEU: A Method for Au-
  the 58th Annual Meeting of the Association for         tomatic Evaluation of Machine Translation. In
  Computational Linguistics, pages 6282–6293,            Proceedings of the 40th Annual Meeting on As-
  Online. Association for Computational Linguis-         sociation for Computational Linguistics, ACL
  tics.                                                  ’02, pages 311–318, Stroudsburg, PA, USA.
                                                         Association for Computational Linguistics.
Philipp Koehn. 2005. Europarl: A parallel cor-
  pus for statistical machine translation. In The      Emmanouil Antonios Platanios, Mrinmaya
  Tenth Machine Translation Summit Proceedings          Sachan, Graham Neubig, and Tom Mitchell.
  of Conference, pages 79–86. International As-         2018. Contextual parameter generation for
  sociation for Machine Translation.                    universal neural machine translation. In Pro-
                                                        ceedings of the 2018 Conference on Empirical
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,            Methods in Natural Language Processing,
  Sergey Edunov, Marjan Ghazvininejad, Mike             pages 425–435.
  Lewis, and Luke Zettlemoyer. 2020. Multilin-
  gual denoising pre-training for neural machine       Martin Popel, Marketa Tomkova, Jakub Tomek,
  translation. Transactions of the Association for      Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bo-
  Computational Linguistics, 8:726–742.                 jar, and Zdeněk Žabokrtský. 2020. Transform-
                                                        ing machine translation: a deep learning system
Minh-Thang Luong, Hieu Pham, and Christo-
                                                        reaches news translation quality comparable to
 pher D. Manning. 2015. Effective approaches
                                                        human professionals. Nature Communications,
 to attention-based neural machine translation.
                                                        11(1):4381.
 In Empirical Methods in Natural Language
 Processing (EMNLP), pages 1412–1421, Lis-             Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason
 bon, Portugal. Association for Computational            Bolton, and Christopher D. Manning. 2020.
 Linguistics.                                            Stanza: A Python natural language processing
                                                         toolkit for many human languages. In Proceed-
Gary Lupyan and Morten H. Christiansen. 2002.
                                                         ings of the 58th Annual Meeting of the Asso-
  Case, word order, and language learnability: In-
                                                         ciation for Computational Linguistics: System
  sights from connectionist modeling. In Pro-
                                                         Demonstrations.
  ceedings of the Twenty-Fourth Annual Confer-
  ence of the Cognitive Science Society.               Shauli Ravfogel, Yoav Goldberg, and Tal Linzen.
Christopher D. Manning, Mihai Surdeanu, John             2019. Studying the inductive biases of RNNs
  Bauer, Jenny Finkel, Steven J. Bethard, and            with synthetic variations of natural languages.
  David McClosky. 2014. The Stanford CoreNLP             In Proceedings of the 2019 Conference of the
  natural language processing toolkit. In Associ-        North American Chapter of the Association for
  ation for Computational Linguistics (ACL) Sys-         Computational Linguistics: Human Language
  tem Demonstrations, pages 55–60.                       Technologies, Volume 1 (Long and Short Pa-
                                                         pers), pages 3532–3542, Minneapolis, Min-
Mitchell Marcus, Beatrice Santorini, and                 nesota. Association for Computational Linguis-
 Mary Ann Marcinkiewicz. 1993. Building                  tics.
Benoît Sagot. 2013. Comparing Complexity Mea-        Adina Williams, Tiago Pimentel, Hagen Blix,
  sures. In Computational approaches to mor-           Arya D. McCarthy, Eleanor Chodroff, and Ryan
  phological complexity, Paris, France. Surrey         Cotterell. 2020. Predicting declension class
  Morphology Group.                                    from form and meaning. In Proceedings of
                                                       the 58th Annual Meeting of the Association for
Karthik Abinav Sankararaman, Soham De, Zheng           Computational Linguistics, pages 6682–6695,
  Xu, W. Ronny Huang, and Tom Goldstein.               Online. Association for Computational Linguis-
  2020. Analyzing the effect of neural net-            tics.
  work architecture on training performance. In
  Proceedings of Machine Learning and Systems        Yang Zhao, Jiajun Zhang, and Chengqing Zong.
  2020, pages 9834–9845.                               2018. Exploiting pre-ordering for neural ma-
                                                       chine translation.   In Proceedings of the
Rico Sennrich, Barry Haddow, and Alexandra             Eleventh International Conference on Lan-
  Birch. 2016. Neural machine translation of rare      guage Resources and Evaluation (LREC-2018).
  words with subword units. In Proceedings of
  the 54th Annual Meeting of the Association for
  Computational Linguistics (Volume 1: Long Pa-      A     Appendices
  pers), pages 1715–1725, Berlin, Germany. As-
                                                     A.1    NMT Hyperparameters
  sociation for Computational Linguistics.
                                                     In the toy parallel grammar experiments (§4),
Kaius Sinnemäki. 2008. Complexity trade-offs in      batch size of 64 (sentences) and 1K max update
  core argument marking. In Language Complex-        steps are used for all models. We train BiLSTM
  ity, pages 67–88. John Benjamins.                  with learning rate 1, and Transformer with learn-
                                                     ing rate of 2 together with 40 warm-up steps by
Dan I. Slobin. 1966. The acquisition of Russian as
                                                     using noam learning rate decay. Dropout ratio of
  a native language. The genesis of language: A
                                                     0.3 and 0.1 are used in BiLSTM and Transformer
  psycholinguistic approach, pages 129–148.
                                                     models respectively. In the synthetic English vari-
Dan I. Slobin and Thomas G. Bever. 1982.             ants experiments (§5), we set a constant learning
  Children use canonical sentence schemas: A         rate of 0.001 for BiLSTM. We also increased batch
  crosslinguistic study of word order and inflec-    size to 128, number of warm-up steps to 80K and
  tions. Cognition, 12(3):229–265.                   update steps to 2M for all models. Finally, for
                                                     100k and 10k datasize experiments, we decreased
Ke Tran, Arianna Bisazza, and Christof Monz.         the warm-up steps to 4K. During evaluation we
  2018. The importance of being recurrent for        chose the best performing model on validation set.
  modeling hierarchical structure. In Proceedings
  of the 2018 Conference on Empirical Methods        A.2    Challenge Set
  in Natural Language Processing, pages 4731–        The English-French challenge set used in this pa-
  4736, Brussels, Belgium. Association for Com-      per, and available at https://github.com/
  putational Linguistics.                            arianna-bis/freeorder-mt, is generated
Ashish Vaswani, Noam Shazeer, Niki Parmar,           by a small synchronous context-free grammar and
  Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,      contains 7,200 simple sentences consisting of a
  Lukasz Kaiser, and Illia Polosukhin. 2017. At-     subject, a transitive verb, and an object (see Ta-
  tention is all you need. In Advances in Neu-       ble 4). All sentences are in the present tense; half
  ral Information Processing Systems 30: An-         are affirmative, and half negative. All nouns in
  nual Conference on Neural Information Pro-         the grammar can plausibly act as both subject and
  cessing Systems 2017, 4-9 December 2017,           object of the verbs, so that an MT system must
  Long Beach, CA, USA, pages 5998–6008.              rely on sentence structure to get perfect translation
                                                     accuracy. The sentences are from a general do-
Dingquan Wang and Jason Eisner. 2016. The            main, but we specifically choose nouns and verbs
  galactic dependencies treebanks: Getting more      with little translation ambiguity that are well rep-
  data by synthesizing new languages. Transac-       resented in the Europarl corpus: most have thou-
  tions of the Association for Computational Lin-    sands of occurrences, while the rarest word has
  guistics, 4:491–505.                               about 80. Sentence example (English side): ‘The
You can also read