Graph Convolutional Encoders for Syntax-aware Neural Machine Translation

Page created by Bruce Goodman
 
CONTINUE READING
Graph Convolutional Encoders
                                                             for Syntax-aware Neural Machine Translation

                                                              Jasmijn Bastings1 Ivan Titov1,2 Wilker Aziz1
                                                                  Diego Marcheggiani1 Khalil Sima’an1
                                                      1
                                                        ILLC, University of Amsterdam 2 ILCC, University of Edinburgh
                                                   {bastings,titov,w.aziz,marcheggiani,k.simaan}@uva.nl

                                                              Abstract                        including RNNs. Despite some successes, tech-
                                                                                              niques explored so far either incorporate syntactic
                                             We present a simple and effective ap-            information in NMT models in a relatively indi-
                                             proach to incorporating syntactic struc-
arXiv:1704.04675v4 [cs.CL] 18 Jun 2020

                                                                                              rect way (e.g., multi-task learning (Luong et al.,
                                             ture into neural attention-based encoder-        2015a; Nadejde et al., 2017; Eriguchi et al., 2017;
                                             decoder models for machine translation.          Hashimoto and Tsuruoka, 2017)) or may be too
                                             We rely on graph-convolutional networks          restrictive in modeling the interface between syn-
                                             (GCNs), a recent class of neural networks        tax and the translation task (e.g., learning repre-
                                             developed for modeling graph-structured          sentations of linguistic phrases (Eriguchi et al.,
                                             data. Our GCNs use predicted syntactic           2016)). Our goal is to provide the encoder with
                                             dependency trees of source sentences to          access to rich syntactic information but let it de-
                                             produce representations of words (i.e. hid-      cide which aspects of syntax are beneficial for
                                             den states of the encoder) that are sensitive    MT, without placing rigid constraints on the in-
                                             to their syntactic neighborhoods. GCNs           teraction between syntax and the translation task.
                                             take word representations as input and           This goal is in line with claims that rigid syntac-
                                             produce word representations as output, so       tic constraints typically hurt MT (Zollmann and
                                             they can easily be incorporated as layers        Venugopal, 2006; Smith and Eisner, 2006; Chiang,
                                             into standard encoders (e.g., on top of bidi-    2010), and, though these claims have been made in
                                             rectional RNNs or convolutional neural           the context of traditional MT systems, we believe
                                             networks). We evaluate their effectiveness       they are no less valid for NMT.
                                             with English-German and English-Czech               Attention-based NMT systems (Bahdanau et al.,
                                             translation experiments for different types      2015; Luong et al., 2015b) represent source sen-
                                             of encoders and observe substantial im-          tence words as latent-feature vectors in the en-
                                             provements over their syntax-agnostic ver-       coder and use these vectors when generating a
                                             sions in all the considered setups.              translation. Our goal is to automatically incorpo-
                                                                                              rate information about syntactic neighborhoods of
                                         1   Introduction
                                                                                              source words into these feature vectors, and, thus,
                                         Neural machine translation (NMT) is one of suc-      potentially improve quality of the translation out-
                                         cess stories of deep learning in natural language    put. Since vectors correspond to words, it is natu-
                                         processing, with recent NMT systems outperform-      ral for us to use dependency syntax. Dependency
                                         ing traditional phrase-based approaches on many      trees (see Figure 1) represent syntactic relations
                                         language pairs (Sennrich et al., 2016a). State-of-   between words: for example, monkey is a subject
                                         the-art NMT systems rely on sequential encoder-      of the predicate eats, and banana is its object.
                                         decoders (Sutskever et al., 2014; Bahdanau et al.,      In order to produce syntax-aware feature
                                         2015) and lack any explicit modeling of syntax or    representations of words, we exploit graph-
                                         any hierarchical structure of language. One poten-   convolutional networks (GCNs) (Duvenaud et al.,
                                         tial reason for why we have not seen much benefit    2015; Defferrard et al., 2016; Kearnes et al., 2016;
                                         from using syntactic information in NMT is the       Kipf and Welling, 2016). GCNs can be regarded
                                         lack of simple and effective methods for incorpo-    as computing a latent-feature representation of a
                                         rating structured information in neural encoders,    node (i.e. a real-valued vector) based on its k-
dobj                      • we introduce a method for incorporating
         det        nsubj                 det
                                                               structure into NMT using syntactic GCNs;
   The         monkey       eats   a            banana
                                                             • we show that GCNs can be used along with
                                                               RNN and CNN encoders;
Figure 1: A dependency tree for the example sen-
                                                             • we show that incorporating structure is ben-
tence: “The monkey eats a banana.”
                                                               eficial for machine translation on English-
                                                               Czech and English-German.
th order neighborhood (i.e. nodes at most k hops
                                                         2     Background
aways from the node) (Gilmer et al., 2017). They
are generally simple and computationally inexpen-        Notation. We use x for vectors, x1:t for a se-
sive. We use Syntactic GCNs, a version of GCN            quence of t vectors, and X for matrices. The i-th
operating on top of syntactic dependency trees, re-      value of vector x is denoted by xi . We use ◦ for
cently shown effective in the context of semantic        vector concatenation.
role labeling (Marcheggiani and Titov, 2017).
                                                         2.1    Neural Machine Translation
   Since syntactic GCNs produce representations
at word level, it is straightforward to use them         In NMT (Kalchbrenner and Blunsom, 2013;
as encoders within the attention-based encoder-          Sutskever et al., 2014; Cho et al., 2014b), given
decoder framework. As NMT systems are trained            example translation pairs from a parallel corpus, a
end-to-end, GCNs end up capturing syntactic              neural network is trained to directly estimate the
properties specifically relevant to the translation      conditional distribution p(y1:Ty |x1:Tx ) of translat-
task. Though GCNs can take word embeddings               ing a source sentence x1:Tx (a sequence of Tx
as input, we will see that they are more effec-          words) into a target sentence y1:Ty . NMT mod-
tive when used as layers on top of recurrent neu-        els typically consist of an encoder, a decoder and
ral network (RNN) or convolutional neural net-           some method for conditioning the decoder on the
work (CNN) encoders (Gehring et al., 2016), en-          encoder, for example, an attention mechanism. We
riching their states with syntactic information.         will now briefly describe the components that we
A comparison to RNNs is the most challenging             use in this paper.
test for GCNs, as it has been shown that RNNs            2.1.1 Encoders
(e.g., LSTMs) are able to capture certain syntac-
                                                         An encoder is a function that takes as input the
tic phenomena (e.g., subject-verb agreement) rea-
                                                         source sentence and produces a representation en-
sonably well on their own, without explicit tree-
                                                         coding its semantic content. We describe recur-
bank supervision (Linzen et al., 2016; Shi et al.,
                                                         rent, convolutional and bag-of-words encoders.
2016). Nevertheless, GCNs appear beneficial even
in this challenging set-up: we obtain +1.2 and +0.7      Recurrent. Recurrent neural networks (RNNs)
BLEU point improvements from using syntactic             (Elman, 1990) model sequential data. They re-
GCNs on top of bidirectional RNNs for English-           ceive one input vector at each time step and up-
German and English-Czech, respectively.                  date their hidden state to summarize all inputs up
   In principle, GCNs are flexible enough to incor-      to that point. Given an input sequence x1:Tx =
porate any linguistic structure as long as they can      x1 , x2 , . . . , xTx of word embeddings an RNN is
be represented as graphs (e.g., dependency-based         defined recursively as follows:
semantic-role labeling representations (Surdeanu
                                                                 RNN(x1:t ) = f (xt , RNN(x1:t−1 ))
et al., 2008), AMR semantic graphs (Banarescu
et al., 2012) and co-reference chains). For ex-          where f is a nonlinear function such as an LSTM
ample, unlike recursive neural networks (Socher          (Hochreiter and Schmidhuber, 1997) or a GRU
et al., 2013), GCNs do not require the graphs to be      (Cho et al., 2014b). We will use the function RNN
trees. However, in this work we solely focus on          as an abstract mapping from an input sequence
dependency syntax and leave more general inves-          x1:T to final hidden state RNN(x1:Tx ), regardless
tigation for future work.                                of the used nonlinearity. To not only summarize
   Our main contributions can be summarized as           the past of a word, but also its future, a bidirec-
follows:                                                 tional RNN (Schuster and Paliwal, 1997; Irsoy and
Cardie, 2014) is often used. A bidirectional RNN         2.2    Graph Convolutional Networks
reads the input sentence in two directions and then      We will now describe the Graph Convolutional
concatenates the states for each time step:              Networks (GCNs) of Kipf and Welling (2016).
                                                         For a comprehensive overview of alternative GCN
B I RNN(x1:Tx , t) = RNNF (x1:t ) ◦ RNNB (xTx :t )
                                                         architectures see Gilmer et al. (2017).
where RNNF and RNNB are the forward and                     A GCN is a multilayer neural network that
backward RNNs, respectively. For further details         operates directly on a graph, encoding informa-
we refer to the encoder of Bahdanau et al. (2015).       tion about the neighborhood of a node as a real-
                                                         valued vector. In each GCN layer, information
Convolutional. Convolutional Neural Networks             flows along edges of the graph; in other words,
(CNNs) apply a fixed-size window over the input          each node receives messages from all its imme-
sequence to capture the local context of each word       diate neighbors. When multiple GCN layers are
(Gehring et al., 2016). One advantage of this ap-        stacked, information about larger neighborhoods
proach over RNNs is that it allows for fast parallel     gets integrated. For example, in the second layer,
computation, while sacrificing non-local context.        a node will receive information from its immediate
To remedy the loss of context, multiple CNN lay-         neighbors, but this information already includes
ers can be stacked. Formally, given an input se-         information from their respective neighbors. By
quence x1:Tx , we define a CNN as follows:               choosing the number of GCN layers, we regulate
                                                         the distance the information travels: with k lay-
CNN(x1:Tx , t) = f (xt−bw/2c , .., xt , .., xt+bw/2c )   ers a node receives information from neighbors at
                                                         most k hops away.
where f is a nonlinear function, typically a lin-
ear transformation followed by ReLU, and w is the           Formally, consider an undirected graph G =
size of the window.                                      (V, E), where V is a set of n nodes, and E is a
                                                         set of edges. Every node is assumed to be con-
Bag-of-Words. In a bag-of-words (BoW) en-                nected to itself, i.e. ∀v ∈ V : (v, v) ∈ E. Now,
coder every word is simply represented by its word       let X ∈ Rd×n be a matrix containing all n nodes
embedding. To give the decoder some sense of             with their features, where d is the dimensionality
word position, position embeddings (PE) may be           of the feature vectors. In our case, X will contain
added. There are different strategies for defining       word embeddings, but in general it can contain any
position embeddings, and in this paper we choose         kind of features. For a 1-layer GCN, the new node
to learn a vector for each absolute word position        representations are computed as follows:
up to a certain maximum length. We then repre-                                                       !
sent the t-th word in a sequence as follows:                                    X
                                                                    hv = ρ               W xu + b
            B OW(x1:Tx , t) = xt + pt                                          u∈N (v)

where xt is the word embedding and pt is the t-th        where W ∈ Rd×d is a weight matrix and b ∈ Rd
position embedding.                                      a bias vector.1 ρ is an activation function, e.g. a
                                                         ReLU. N (v) is the set of neighbors of v, which we
2.1.2 Decoder                                            assume here to always include v itself. As stated
A decoder produces the target sentence condi-            before, to allow information to flow over multiple
tioned on the representation of the source sentence      hops, we need to stack GCN layers. The recursive
induced by the encoder. In Bahdanau et al. (2015)        computation is as follows:
the decoder is implemented as an RNN condi-
                                                                                                               !
tioned on an additional input ci , the context vector,                         X
which is dynamically computed at each time step                 h(j+1)
                                                                 v       =ρ             W (j) h(j)
                                                                                               u     +b  (j)

using an attention mechanism.                                                 u∈N (v)

   The probability of a target word yi is now a
                                                                                              (0)
function of the decoder RNN state, the previous          where j indexes the layer, and hv = xv .
target word embedding, and the context vector.              1
                                                             We dropped the normalization factor used by Kipf
The model is trained end-to-end for maximum log          and Welling (2016), as it is not used in syntactic GCNs
likelihood of the next target word given its context.    of Marcheggiani and Titov (2017).
h(2)
                                             )                )               W (1)             )
                                           (1               (1 j                              (1
                                                                b              dobj
                                         W det            W nsu                             W det
   GCN

                    h(1)
                                             )                )               W (0)             )
                                           (0               (0 j                              (0
                                                                b              dobj
                                         W det            W nsu                             W det

                    h(0)
   CNN

              *PAD*              The             monkey             eats              a             banana   *PAD*

Figure 2: A 2-layer syntactic GCN on top of a convolutional encoder. Loop connections are depicted
with dashed edges, syntactic ones with solid (dependents to heads) and dotted (heads to dependents)
edges. Gates and some labels are omitted for clarity.

2.3      Syntactic GCNs                                               Labels. Making the GCN sensitive to labels is
Marcheggiani and Titov (2017) generalize GCNs                         straightforward given the above modifications for
to operate on directed and labeled graphs.2 This                      directionality. Instead of using separate matrices
makes it possible to use linguistic structures such                   for each direction, separate matrices are now de-
as dependency trees, where directionality and edge                    fined for each direction and label combination:
labels play an important role. They also integrate                                                                    !
                                                                                                (j)            (j)
                                                                                       X
edge-wise gates which let the model regulate con-                      h(j+1)
                                                                         v     =ρ            Wlab(u,v) h(j)
                                                                                                        u + blab(u,v)
tributions of individual dependency edges. We                                             u∈N (v)
will briefly describe these modifications.
                                                                      where we incorporate the directionality of an edge
Directionality. In order to deal with direction-                      directly in its label.
ality of edges, separate weight matrices are used                        Importantly, to prevent over-parametrization,
for incoming and outgoing edges. We follow the                        only bias terms are made label-specific, in other
convention that in dependency trees heads point                       words: Wlab(u,v) = Wdir(u,v) . The resulting syn-
to their dependents, and thus outgoing edges are                      tactic GCN is illustrated in Figure 2 (shown on top
used for head-to-dependent connections, and in-                       of a CNN, as we will explain in the subsequent
coming edges are used for dependent-to-head con-                      section).
nections. Modifying the recursive computation for
directionality, we arrive at:                                         Edge-wise gating. Syntactic GCNs also include
                                                 !                    gates, which can down-weight the contribution of
                 X        (j)           (j)                           individual edges. They also allow the model to
 h(j+1)
   v     =ρ             Wdir(u,v) h(j)
                                   u + bdir(u,v)                      deal with noisy predicted structure, i.e. to ignore
                   u∈N (v)
                                                                      potentially erroneous syntactic edges. For each
where dir(u, v) selects the weight matrix associ-                     edge, a scalar gate is calculated as follows:
ated with the directionality of the edge connecting                                   
                                                                                                  (j)          (j)
                                                                                                                        
                                                                              (j)
u and v (i.e. WIN for u-to-v, WOUT for v-to-u,                               gu,v = σ h(j)
                                                                                         u   · ŵ dir(u,v) + b̂lab(u,v)
and WLOOP for v-to-v). Note that self loops are
modeled separately,                                                   where σ is the logistic sigmoid function, and
                                                                        (j)                  (j)
   so there are now three times as many parameters                    ŵdir(u,v) ∈ Rd and b̂lab(u,v) ∈ R are learned pa-
as in a non-directional GCN.                                          rameters for the gate. The computation becomes:
   2
     For an alternative approach to integrating labels and di-                   X
                                                                                                 (j)              (j)      
rections, see applications of GCNs to statistical relation learn-      h(j+1)
                                                                        v     =ρ       g (j)
                                                                                        u,v  W           h
                                                                                                 dir(u,v) u
                                                                                                           (j)
                                                                                                               + blab(u,v)
ing (Schlichtkrull et al., 2017).                                                 u∈N (v)
3   Graph Convolutional Encoders                      necting words that otherwise might be far apart.
                                                      The model might not only benefit from this tele-
In this work we focus on exploiting structural in-    porting capability however; also the nature of the
formation on the source side, i.e. in the encoder.    relations between words (i.e. dependency relation
We hypothesize that using an encoder that incor-      types) may be useful, and the GCN exploits this
porates syntax will lead to more informative rep-     information (see §2.3 for details).
resentations of words, and that these representa-
                                                         This is the most challenging setup for GCNs,
tions, when used as context vectors by the decoder,
                                                      as RNNs have been shown capable of capturing at
will lead to an improvement in translation qual-
                                                      least some degree of syntactic information with-
ity. Consequently, in all our models, we use the
                                                      out explicit supervision (Linzen et al., 2016), and
decoder of Bahdanau et al. (2015) and keep this
                                                      hence they should be hard to improve on by incor-
part of the model constant. As is now common
                                                      porating treebank syntax.
practice, we do not use a maxout layer in the de-
coder, but apart from this we do not deviate from        Marcheggiani and Titov (2017) did not observe
the original definition. In all models we make use    improvements from using multiple GCN layers in
of GRUs (Cho et al., 2014b) as our RNN units.         semantic role labeling. However, we do expect
                                                      that propagating information from further in the
   Our models vary in the encoder part, where we
                                                      tree should be beneficial in principle. We hypoth-
exploit the power of GCNs to induce syntactically-
                                                      esize that the first layer is the most influential one,
aware representations. We now define a series of
                                                      capturing most of the syntactic context, and that
encoders of increasing complexity.
                                                      additional layers only modestly modify the repre-
BoW + GCN. In our first and simplest model,           sentations. To ease optimization, we add a resid-
we propose a bag-of-words encoder (with position      ual connection (He et al., 2016) between the GCN
embeddings, see §2.1.1), with a GCN on top. In        layers, when using more than one layer.
other words, inputs h(0) are a sum of embeddings
of a word and its position in a sentence. Since the   4       Experiments
original BoW encoder captures the linear order-
ing information only in a very crude way (through     Experiments are performed using the Neural Mon-
the position embeddings), the structural informa-     key toolkit3 (Helcl and Libovický, 2017), which
tion provided by GCN should be highly beneficial.     implements the model of Bahdanau et al. (2015)
                                                      in TensorFlow. We use the Adam optimizer
Convolutional + GCN. In our second model,             (Kingma and Ba, 2015) with a learning rate of
we use convolutional neural networks to learn         0.001 (0.0002 for CNN models).4 The batch size
word representations. CNNs are fast, but by def-      is set to 80. Between layers we apply dropout
inition only use a limited window of context. In-     with a probability of 0.2, and in experiments with
stead of the approach used by Gehring et al. (2016)   GCNs5 we use the same value for edge dropout.
(i.e. stacking multiple CNN layers on top of each     We train for 45 epochs, evaluating the BLEU per-
other), we use a GCN to enrich the one-layer CNN      formance of the model every epoch on the vali-
representations. Figure 2 shows this model. Note      dation set. For testing, we select the model with
that, while the figure shows a CNN with a window      the highest validation BLEU. L2 regularization is
size of 3, we will use a larger window size of 5 in   used with a value of 10−8 . All the model selection
our experiments. We expect this model to perform      (incl. hyperparameter selections) was performed
better than BoW + GCN, because of the additional      on the validation set. In all experiments we obtain
local context captured by the CNN.                    translations using a greedy decoder, i.e. we se-
                                                      lect the output token with the highest probability
BiRNN + GCN. In our third and most powerful
                                                      at each time step.
model, we employ bidirectional recurrent neural
networks. In this model, we start by encoding the        We will describe an artificial experiment in §4.1
source sentence using a BiRNN (i.e. BiGRU), and       and MT experiments in §4.2.
use the resulting hidden states as input to a GCN.        3
                                                           https://github.com/ufal/neuralmonkey
Instead of relying on linear order only, the GCN          4
                                                           Like Gehring et al. (2016) we note that Adam is too ag-
will allow the encoder to ‘teleport’ over parts of    gressive for CNN models, hence we use a lower learning rate.
                                                         5
the input sentence, along dependency edges, con-           GCN code at https://github.com/bastings/neuralmonkey
8
4.1      Reordering artificial sequences
Our goal here is to provide an intuition for the ca-                            6
pabilities of GCNs. We define a reordering task                                 4

                                                           Mean gate bias
where randomly permuted sequences need to be
                                                                                2                              real edges
put back into the original order. We encode the                                                                fake edges
original order using edges, and test if GCNs can                                0
successfully exploit them. Note that this task is not                           2
meant to provide a fair comparison to RNNs. The
input (besides the edges) simply does not carry any                             4
information about the original ordering, so RNNs
                                                                                    0   50   100      150     200      250
cannot possibly solve this task.                                                             Steps (x1000)
Data. From a vocabulary of 26 types, we gen-
                                                           Figure 3: Mean values of gate bias terms for real
erate random sequences of 3-10 tokens. We then
                                                           (useful) labels and for fake (non useful) labels sug-
randomly permute them, pointing every token to
                                                           gest the GCN learns to distinguish them.
its original predecessor with a label sampled from
a set of 5 labels. Additionally, we point every to-
ken to an arbitrary position in the sequence with a        trees by SyntaxNet,7 using the pre-trained Parsey
label from a distinct set of 5 ‘fake’ labels. We sam-      McParseface model.8 The Czech and German
ple 25000 training and 1000 validation sequences.          sides are tokenized using the Moses tokenizer.9
                                                           Sentence pairs where either side is longer than 50
Model. We use the BiRNN + GCN model, i.e. a                words are filtered out after tokenization.
bidirectional GRU with a 1-layer GCN on top. We
use 32, 64 and 128 units for embeddings, GRU               Vocabularies. For the English sides, we con-
units and GCN layers, respectively.                        struct vocabularies from all words except those
                                                           with a training set frequency smaller than three.
Results. After 6 epochs of training, the model
                                                           For Czech and German, to deal with rare words
learns to put permuted sequences back into or-
                                                           and phenomena such as inflection and compound-
der, reaching a validation BLEU of 99.2. Fig-
                                                           ing, we learn byte-pair encodings (BPE) as de-
ure 3 shows that the mean values of the bias
                                                           scribed by Sennrich et al. (2016b). Given the size
terms of gates (i.e. b̂) for real and fake edges are
                                                           of our data set, and following Wu et al. (2016), we
far apart, suggesting that the GCN learns to dis-
                                                           use 8000 BPE merges to obtain robust frequencies
tinguish them. Interestingly, this illustrates why
                                                           for our subword units (16000 merges for full data
edge-wise gating is beneficial. A gate-less model
                                                           experiment). Data set statistics are summarized in
would not understand which of the two outgoing
                                                           Table 1 and vocabulary sizes in Table 2.
arcs is fake and which is genuine, because only
biases b would then be label-dependent. Conse-
                                                                                                      Train     Val.     Test
quently, it would only do a mediocre job in re-
ordering. Although using label-specific matrices            English-German                         226822      2169     2999
W would also help, this would not scale to the real         English-German (full)                 4500966      2169     2999
scenario (see §2.3).                                        English-Czech                          181112      2656     2999

4.2      Machine Translation                               Table 1: The number of sentences in our data sets.
Data. For our experiments we use the En-De
and En-Cs News Commentary v11 data from the                Hyperparameters. We use 256 units for word
WMT16 translation task.6 For En-De we also                 embeddings, 512 units for GRUs (800 for En-De
train on the full WMT16 data set. As our valida-           full data set experiment), and 512 units for con-
tion set and test set we use newstest2015 and              volutional layers (or equivalently, 512 ‘channels’).
newstest2016, respectively.                                The dimensionality of the GCN layers is equiva-
Pre-processing. The English sides of the cor-                               7
                                                                 https://github.com/tensorflow/models/tree/master/syntaxnet
                                                                            8
pora are tokenized and parsed into dependency                    The used dependency parses can be reproduced by using
                                                           the syntaxnet/demo.sh shell script.
   6                                                           9
       http://www.statmt.org/wmt16/translation-task.html         https://github.com/moses-smt/mosesdecoder
Source              Target                         Kendall    BLEU1         BLEU4
 English-German                 37824      8099 (BPE)            BoW               0.3352          40.6       9.5
 English-German (full)          50000     16000 (BPE)              + GCN           0.3520          44.9      12.2
 English-Czech                  33786      8116 (BPE)
                                                                 CNN               0.3601          42.8      12.6
                                                                  + GCN            0.3777          44.7      13.7
              Table 2: Vocabulary sizes.
                                                                 BiRNN             0.3984          45.2      14.9
                                                                   + GCN           0.4089          47.5      16.1
lent to the dimensionality of their input. We report
results for 2-layer GCNs, as we find them most ef-               BiRNN (full)      0.5440          53.0      23.3
fective (see ablation studies below).                              + GCN           0.5555          54.6      23.9

Baselines. We provide three baselines, each                        Table 3: Test results for English-German.
with a different encoder: a bag-of-words encoder,
a convolutional encoder with window size w = 5,
                                                               higher than both baselines so far. Finally, the
and a BiRNN. See §2.1.1 for details.
                                                               BiRNN baseline scores a BLEU4 of 8.9, but it
Evaluation. We report (cased) BLEU results                     is again beaten by the BiRNN+GCN model with
(Papineni et al., 2002) using multi-bleu, as                   +1.9 BLEU1 and +0.7 BLEU4 .
well as Kendall τ reordering scores.10
                                                                                Kendall     BLEU1         BLEU4
4.2.1 Results
English-German. Table 3 shows test results                         BoW           0.2498        32.9          6.0
on English-German. Unsurprisingly, the bag-of-                       + GCN       0.2561        35.4          7.5
words baseline performs the worst. We expected                     CNN           0.2756        35.1          8.1
the BoW+GCN model to make easy gains over                           + GCN        0.2850        36.1          8.7
this baseline, which is indeed what happens. The
                                                                   BiRNN         0.2961        36.9          8.9
CNN baseline reaches a higher BLEU4 score than
                                                                     + GCN       0.3046        38.8          9.6
the BoW models, but interestingly its BLEU1
score is lower than the BoW+GCN model. The                          Table 4: Test results for English-Czech.
CNN+GCN model improves over the CNN base-
line by +1.9 and +1.1 for BLEU1 and BLEU4 , re-
spectively. The BiRNN, the strongest baseline,                 Effect of GCN layers. How many GCN layers
reaches a BLEU4 of 14.9. Interestingly, GCNs                   do we need? Every layer gives us an extra hop
still manage to improve the result by +2.3 BLEU1               in the graph and expands the syntactic neighbor-
and +1.2 BLEU4 points. Finally, we observe a big               hood of a word. Table 5 shows validation BLEU
jump in BLEU4 by using the full data set and beam              performance as a function of the number of GCN
search (beam 12). The BiRNN now reaches 23.3,                  layers. For English-German, using a 1-layer GCN
while adding a GCN achieves a score of 23.9.                   improves BLEU-1, but surprisingly has little effect
                                                               on BLEU4 . Adding an additional layer gives im-
English-Czech. Table 4 shows test results on                   provements on both BLEU1 and BLEU4 of +1.3
English-Czech. While it is difficult to obtain high            and +0.73, respectively. For English-Czech, per-
absolute BLEU scores on this dataset, we can still             formance increases with each added GCN layer.
see similar relative improvements. Again the BoW
baseline scores worst, with the BoW+GCN eas-                                        En-De                 En-Cs
ily beating that result. The CNN baseline scores                                BLEU1     BLEU4     BLEU1     BLEU4
BLEU4 of 8.1, but the CNN+GCN improves on
                                                               BiRNN             44.2       14.1      37.8         8.9
that, this time by +1.0 and +0.6 for BLEU1 and
                                                                + GCN (1L)       45.0       14.1      38.3         9.6
BLEU4 , respectively. Interestingly, BLEU1 scores
                                                                + GCN (2L)       46.3       14.8      39.6         9.9
for the BoW+GCN and CNN+GCN models are
   10
      See Stanojević and Simaan (2015). TER (Snover et al.,   Table 5: Validation BLEU for English-German
2006) and BEER (Stanojević and Sima’an, 2014) metrics,
even though omitted due to space considerations, are con-      and English-Czech for 1- and 2-layer GCNs.
sistent with the reported results.
Effect of sentence length. We hypothesize that         Hiero trees are not syntactically-aware, but instead
GCNs should be more beneficial for longer sen-         constrained by symmetrized word alignments.
tences: these are likely to contain long-distance      Aharoni and Goldberg (2017) propose neural
syntactic dependencies which may not be ade-           string-to-tree by predicting linearized parse trees.
quately captured by RNNs but directly encoded
in GCNs. To test this, we partition the validation     Multi-task Learning. Sharing NMT parameters
data into five buckets and calculate BLEU for each     with a syntactic parser is a popular approach to
of them. Figure 4 shows that GCN-based models          obtaining syntactically-aware representations. Lu-
outperform their respective baselines rather uni-      ong et al. (2015a) predict linearized constituency
formly across all buckets. This is a surprising re-    parses as an additional task. Eriguchi et al. (2017)
sult. One explanation may be that syntactic parses     multi-task with a target-side RNNG parser (Dyer
are noisier for longer sentences, and this prevents    et al., 2016) and improve on various language
us from obtaining extra improvements with GCNs.        pairs with English on the target side. Nadejde et al.
                                                       (2017) multi-task with CCG tagging, and also in-
       16                                              tegrate syntax on the target side by predicting a se-
                                                       quence of words interleaved with CCG supertags.
       14
                                                       Latent structure. Hashimoto and Tsuruoka
                                                       (2017) add a syntax-inspired encoder on top of
       12
BLEU

                                                       a BiLSTM layer. They encode source words as
                                                       a learned average of potential parents emulating
       10     CNN
              CNN + GCN                                a relaxed dependency tree. While their model is
              BiRNN                                    trained purely on translation data, they also ex-
       8      BiRNN + GCN
                                                       periment with pre-training the encoder using tree-
yond syntax, by using semantic annotations such          David K Duvenaud, Dougal Maclaurin, Jorge Ipar-
as SRL and AMR, and co-reference chains.                   raguirre, Rafael Bombarell, Timothy Hirzel, Alán
                                                           Aspuru-Guzik, and Ryan P Adams. 2015. Convo-
                                                           lutional networks on graphs for learning molecular
Acknowledgments                                            fingerprints. In Advances in neural information pro-
We would like to thank Michael Schlichtkrull and           cessing systems, pages 2224–2232.
Thomas Kipf for their suggestions and comments.          Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,
This work was supported by the European Re-                and Noah A. Smith. 2016. Recurrent neural network
search Council (ERC StG BroadSem 678254) and               grammars. In Proceedings of the 2016 Conference
the Dutch National Science Foundation (NWO                 of the North American Chapter of the Association
                                                           for Computational Linguistics: Human Language
VIDI 639.022.518, NWO VICI 277-89-002).                    Technologies, pages 199–209, San Diego, Califor-
                                                           nia. Association for Computational Linguistics.

References                                               Jeffrey L Elman. 1990. Finding structure in time. Cog-
                                                            nitive science, 14(2):179–211.
Roee Aharoni and Yoav Goldberg. 2017. Towards
  String-to-Tree Neural Machine Translation. ArXiv       Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa
  e-prints.                                                Tsuruoka. 2016. Tree-to-sequence attentional neu-
                                                           ral machine translation. In Proceedings of the 54th
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-           Annual Meeting of the Association for Computa-
  gio. 2015. Neural Machine Translation by Jointly         tional Linguistics (Volume 1: Long Papers), pages
  Learning to Align and Translate. In Proceedings of       823–833, Berlin, Germany. Association for Compu-
  the International Conference on Learning Represen-       tational Linguistics.
  tations (ICLR).
                                                         Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun
Laura Banarescu, Claire Bonial, Shu Cai, Madalina          Cho. 2017. Learning to Parse and Translate Im-
  Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin           proves Neural Machine Translation. ArXiv e-prints.
  Knight, Philipp Koehn, Martha Palmer, and Nathan
  Schneider. 2012. Abstract meaning representation       Jonas Gehring, Michael Auli, David Grangier, and
  (amr) 1.0 specification. In Conference on Empiri-        Yann N. Dauphin. 2016. A convolutional en-
  cal Methods in Natural Language Processing, pages        coder model for neural machine translation. CoRR,
  1533–1544.                                               abs/1611.02344.
David Chiang. 2010. Learning to translate with source    Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley,
  and target syntax. In Proceedings of the 48th Annual      Oriol Vinyals, and George E. Dahl. 2017. Neural
  Meeting of the Association for Computational Lin-         Message Passing for Quantum Chemistry. ArXiv e-
  guistics, pages 1443–1452, Uppsala, Sweden. Asso-         prints.
  ciation for Computational Linguistics.
                                                         Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017.
KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-          Neural machine translation with source-side latent
  danau, and Yoshua Bengio. 2014a. On the Prop-            graph parsing. CoRR, abs/1702.02265.
  erties of Neural Machine Translation: Encoder-
  Decoder Approaches. In SSST-8, Eighth Workshop         Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
  on Syntax, Semantics and Structure in Statistical        Sun. 2016. Deep residual learning for image recog-
  Translation, volume abs/1409.1259, pages 103–111.        nition. In Proceedings of the IEEE Conference on
                                                           Computer Vision and Pattern Recognition, pages
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-           770–778.
  cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
  ger Schwenk, and Yoshua Bengio. 2014b. Learn-          Jindřich Helcl and Jindřich Libovický. 2017. Neural
  ing Phrase Representations using RNN Encoder–             monkey: An open-source tool for sequence learn-
  Decoder for Statistical Machine Translation. In Pro-      ing. The Prague Bulletin of Mathematical Linguis-
  ceedings of the 2014 Conference on Empirical Meth-        tics, (107):5–17.
  ods in Natural Language Processing (EMNLP),
  pages 1724–1734, Doha, Qatar. Association for          Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Computational Linguistics.                               Long Short-Term Memory. Neural Computation,
                                                           9(8):1735–1780.
Michaël Defferrard, Xavier Bresson, and Pierre Van-
  dergheynst. 2016. Convolutional neural networks        Ozan Irsoy and Claire Cardie. 2014. Opinion Mining
  on graphs with fast localized spectral filtering. In     with Deep Recurrent Neural Networks. In Proceed-
  Advances in Neural Information Processing Sys-           ings of the 2014 Conference on Empirical Methods
  tems 29: Annual Conference on Neural Information         in Natural Language Processing (EMNLP), pages
  Processing Systems 2016, December 5-10, 2016,            720–728, Doha, Qatar. Association for Computa-
  Barcelona, Spain, pages 3837–3845.                       tional Linguistics.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent        Rico Sennrich and Barry Haddow. 2016. Linguistic In-
  Continuous Translation Models. In Proceedings of          put Features Improve Neural Machine Translation.
  the 2013 Conference on Empirical Methods in Natu-         In Proceedings of the First Conference on Machine
  ral Language Processing, pages 1700–1709, Seattle,        Translation (WMT16), volume abs/1606.02892.
  Washington, USA.
                                                          Rico Sennrich, Barry Haddow, and Alexandra Birch.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,           2016a. Edinburgh neural machine translation sys-
  Aäron van den Oord, Alex Graves, and Koray               tems for wmt 16. In Proceedings of the First
  Kavukcuoglu. 2016. Neural machine translation in          Conference on Machine Translation, pages 371–
  linear time. CoRR, abs/1610.10099.                        376, Berlin, Germany. Association for Computa-
                                                            tional Linguistics.
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay
   Pande, and Patrick Riley. 2016. Molecular graph        Rico Sennrich, Barry Haddow, and Alexandra Birch.
   convolutions: moving beyond fingerprints. Jour-          2016b. Neural machine translation of rare words
   nal of computer-aided molecular design, 30(8):595–       with subword units. In Proceedings of the 54th An-
   608.                                                     nual Meeting of the Association for Computational
                                                            Linguistics (Volume 1: Long Papers), pages 1715–
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
                                                            1725, Berlin, Germany. Association for Computa-
  method for stochastic optimization. In ICLR.
                                                            tional Linguistics.
Thomas N. Kipf and Max Welling. 2016. Semi-
  supervised classification with graph convolutional      Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
  networks. CoRR, abs/1609.02907.                           string-based neural mt learn source syntax? In Pro-
                                                            ceedings of the 2016 Conference on Empirical Meth-
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.             ods in Natural Language Processing, pages 1526–
  2016. Assessing the ability of lstms to learn syntax-     1534, Austin, Texas. Association for Computational
  sensitive dependencies. Transactions of the Associ-       Linguistics.
  ation for Computational Linguistics, 4:521–535.
                                                          David Smith and Jason Eisner. 2006.          Quasi-
Minh-Thang Luong, Quoc V. Le, Ilya Sutskever,               synchronous grammars: Alignment by soft projec-
  Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi-           tion of syntactic dependencies. In Proceedings on
  task Sequence to Sequence Learning.      CoRR,            the Workshop on Statistical Machine Translation,
  abs/1511.06114.                                           pages 23–30, New York City. Association for Com-
                                                            putational Linguistics.
Thang Luong, Hieu Pham, and Christopher D. Man-
  ning. 2015b. Effective Approaches to Attention-         Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
  based Neural Machine Translation. In Proceed-            nea Micciulla, and John Makhoul. 2006. A study
  ings of the 2015 Conference on Empirical Meth-           of translation edit rate with targeted human annota-
  ods in Natural Language Processing, pages 1412–          tion. In In Proceedings of Association for Machine
  1421, Lisbon, Portugal. Association for Computa-         Translation in the Americas, pages 223–231.
  tional Linguistics.
                                                          Richard Socher, Alex Perelygin, Jean Wu, Jason
Diego Marcheggiani and Ivan Titov. 2017. Encoding           Chuang, Christopher D. Manning, Andrew Ng, and
  Sentences with Graph Convolutional Networks for           Christopher Potts. 2013. Recursive deep models
  Semantic Role Labeling. In Proceedings of the 2017        for semantic compositionality over a sentiment tree-
  Conference on Empirical Methods in Natural Lan-           bank. In Proceedings of EMNLP.
  guage Processing, Copenhagen, Denmark. Associa-
  tion for Computational Linguistics.                     Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill
                                                            Byrne. 2016. Syntactically guided neural machine
Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz
                                                            translation. In Proceedings of the 54th Annual Meet-
 Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn,
                                                            ing of the Association for Computational Linguistics
 and Alexandra Birch. 2017. Syntax-aware Neural
                                                            (Volume 2: Short Papers), pages 299–305, Berlin,
 Machine Translation Using CCG. ArXiv e-prints.
                                                            Germany. Association for Computational Linguis-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei          tics.
  jing Zhu. 2002. Bleu: a method for automatic eval-
  uation of machine translation. pages 311–318.           Miloš Stanojević and Khalil Simaan. 2015. Evaluating
                                                            mt systems with beer. The Prague Bulletin of Math-
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,          ematical Linguistics, 104(1):17–26.
  Rianne van den Berg, Ivan Titov, and Max Welling.
  2017. Modeling Relational Data with Graph Convo-        Miloš Stanojević and Khalil Sima’an. 2014. Fit-
  lutional Networks. ArXiv e-prints.                        ting sentence level translation evaluation with many
                                                            dense features. In Proceedings of the 2014 Con-
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-         ference on Empirical Methods in Natural Language
  tional recurrent neural networks. IEEE Transactions       Processing (EMNLP), pages 202–206, Doha, Qatar.
  on Signal Processing, 45(11):2673–2681.                   Association for Computational Linguistics.
Mihai Surdeanu, Richard Johansson, Adam Meyers,
  Lluı́s Màrquez, and Joakim Nivre. 2008. The conll
  2008 shared task on joint parsing of syntactic and
  semantic dependencies. In Proceedings of CoNLL.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
   Sequence to Sequence Learning with Neural Net-
   works. In Neural Information Processing Systems
   (NIPS), pages 3104–3112.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
  Le, Mohammad Norouzi, Wolfgang Macherey,
  Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  Macherey, Jeff Klingner, Apurva Shah, Melvin
  Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
  Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
  Kazawa, Keith Stevens, George Kurian, Nishant
  Patil, Wei Wang, Cliff Young, Jason Smith, Jason
  Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
  Macduff Hughes, and Jeffrey Dean. 2016. Google’s
  neural machine translation system: Bridging the gap
  between human and machine translation. CoRR,
  abs/1609.08144.
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward
  Grefenstette, and Wang Ling. 2016. Learning to
  compose words into sentences with reinforcement
  learning. CoRR, abs/1611.09100.
Andreas Zollmann and Ashish Venugopal. 2006. Syn-
  tax augmented machine translation via chart pars-
  ing. In Proceedings of the Workshop on Statistical
  Machine Translation, StatMT ’06, pages 138–141,
  Stroudsburg, PA, USA. Association for Computa-
  tional Linguistics.
You can also read