Towards Linear Time Neural Machine Translation with Capsule Networks

Page created by Joanne Robinson
 
CONTINUE READING
Towards Linear Time Neural Machine Translation with Capsule
                                                                         Networks
                                                  Mingxuan Wang1 Jun xie2 Zhixing Tan2 Jinsong Su3 Deyi Xiong4 Lei Li1
                                                                     1
                                                                       ByteDance AI Lab, Beijing, China
                                                          {wangmingxuan.89,lileilab}@bytedance.com
                                                            2
                                                              Mobile Internet Group, Tencent Technology Co., Ltd
                                                                     3
                                                                       Xiamen University, Xiamen, China
                                                                      4
                                                                        Tianjin University, Tianjin, China

                                                                 Abstract                                et al., 2003; Chiang, 2005) which consist of mul-
                                             In this study, we first investigate a novel cap-
                                                                                                         tiple separately tuned components, NMT aims at
                                             sule network with dynamic routing for lin-                  building upon a single and large neural network
arXiv:1811.00287v4 [cs.CL] 12 Oct 2020

                                             ear time Neural Machine Translation (NMT),                  to directly map input text to associated output text
                                             referred as C APS NMT. C APS NMT uses an                    (Sutskever et al., 2014).
                                             aggregation mechanism to map the source                        In general, there are several research lines of
                                             sentence into a matrix with pre-determined                  NMT architectures, among which the Enc-Dec
                                             size, and then applys a deep LSTM net-                      NMT (Sutskever et al., 2014) and the Enc-Dec
                                             work to decode the target sequence from the
                                                                                                         Att NMT are of typical representation (Bahdanau
                                             source representation. Unlike the previous
                                             work (Sutskever et al., 2014) to store the                  et al., 2014; Wu et al., 2016; Vaswani et al., 2017).
                                             source sentence with a passive and bottom-                  The Enc-Dec represents the source inputs with a
                                             up way, the dynamic routing policy encodes                  fixed dimensional vector and the target sequence
                                             the source sentence with an iterative process               is generated from this vector word by word. The
                                             to decide the credit attribution between nodes              Enc-Dec, however, does not preserve the source
                                             from lower and higher layers. C APS NMT                     sequence resolution, a feature which aggravates
                                             has two core properties: it runs in time that
                                                                                                         learning for long sequences. This results in the
                                             is linear in the length of the sequences and
                                             provides a more flexible way to aggregate                   computational complexity of decoding process be-
                                             the part-whole information of the source sen-               ing O(|S| + |T |), with |S| denoting the source
                                             tence. On WMT14 English-German task and                     sentence length and |T | denoting the target sen-
                                             a larger WMT14 English-French task, C AP -                  tence length. The Enc-Dec Att preserves the reso-
                                             S NMT achieves comparable results with the                  lution of the source sentence which frees the neu-
                                             Transformer system. We also devise new                      ral model from having to squash all the informa-
                                             hybrid architectures intended to combine the                tion into a fixed represention, but at a cost of a
                                             strength of C APS NMT and the RNMT model.
                                                                                                         quadratic running time. Due to the attention mech-
                                             Our hybrid models obtain state-of-the-arts re-
                                             sults on both benchmark datasets. To the best               anism, the computational complexity of decoding
                                             of our knowledge, this is the first work that               process is O(|S| × |T |). This drawbacks grow
                                             capsule networks have been empirically inves-               more severe as the length of the sequences in-
                                             tigated for sequence to sequence problems.1                 creases.
                                                                                                            Currently, most work focused on the Enc-Dec
                                         1    Introduction
                                                                                                         Att, while the Enc-Dec paradigm is less empha-
                                         Neural Machine Translation (NMT) is an end-                     sized on despite its advantage of linear-time de-
                                         to-end learning approach to machine translation                 coding (Kalchbrenner et al., 2016). The linear-
                                         which has recently shown promising results on                   time approach is appealing, however, the perfor-
                                         multiple language pairs (Luong et al., 2015; Shen               mance lags behind the Enc-Dec Att. One poten-
                                         et al., 2015; Wu et al., 2016; Gehring et al., 2017a;           tial issue is that the Enc-Dec needs to be able
                                         Kalchbrenner et al., 2016; Sennrich et al., 2015;               to compress all the necessary information of the
                                         Vaswani et al., 2017). Unlike conventional Statis-              input source sentence into context vectors which
                                         tical Machine Translation (SMT) systems (Koehn                  are fixed during decoding. Therefore, a natural
                                             1
                                               The work is partially done when the first author worked   question was raised, Will carefully designed ag-
                                         at Tencent.                                                     gregation operations help the Enc-Dec paradigm
Output
to achieve the best performance?                                                        Probabilities

   In recent promising work of capsule network, a                                         Softmax

dynamic routing policy is proposed and proven to                                           Linear
effective (Sabour et al., 2017; Zhao et al., 2018;
Gong et al., 2018). Following a similar spirit to                    Parent Capsule
                                                            N×
use this technique, we present C APS NMT, which                       Child Capsule
                                                                                        Add & Norm

                                                                                          Forward
is characterized by capsule encoder to address                                             LSTM
the drawbacks of the conventional linear-time ap-                                                         N×
proaches. The capsule encoder processes the at-                       Add & Norm        Add & Norm

tractive potential to address the aggregation issue,                    Bi-LSTM
                                                                         Layer
                                                                                         Projection
                                                                                           Layer
and then introduces an iterative routing policy to
decide the credit attribution between nodes from
lower (child) and higher (parent) layers. Three                          Input            Input
                                                                       Embedding        Embedding
strategies are also proposed to stabilize the dy-
                                                                         Inputs            Outputs
namic routing process. We empirically verify                                            (shifted right)
C APS NMT on WMT14 English-German task and
a larger WMT14 English-French task. C APS NMT          Figure 1: C APS NMT:Linear time neural machine
achieves comparable results with the state-of-the-     translation with capsule encoder
art Transformer systems. Our contributions can be
summrized as follows:
                                                       Constant Encoder with Aggregation Lay-
                                                       ers Given the input source sentence x =
    • We propose a sophisticated designed linear-
                                                       x1 , x2 · · · , xL , Then, we introduce capsule net-
      time C APS NMT which achieved compara-
                                                       works to transfer
      ble results with the Transformer framework.
      To the best of our knowledge, C APS NMT                    X = [x1 , x2 , · · · , xL ] ∈ RL×dx .
      is the first work that capsule networks have
      been empirically investigated for sequence-      The goal of the constant encoder is to transfer the
      to-sequence problems.                            inputs X ∈ RL×dv into a pre-determined size rep-
                                                       resentation
    • We propose several techniques to stabilize
                                                                 C = [c1 , c2 , · · · , cM ] ∈ RM ×dc .
      the dynamic routing process. We believe
      that these technique should always be em-        where M < L is the pre-determined size of the en-
      ployed by capsule networks for the best per-     coder output, and dc is the dimension of the hidden
      formance.                                        states.
                                                          We first introduce a bi-directional LSTM (BiL-
2    Linear Time Neural Machine                        STM) as the primary-capsule layer to incorporate
     Translation                                       forward and backward context information of the
                                                       input sentence:
From the perspective of machine learning, the
task of linear-time translation can be formalized                     →
                                                                      −    −−−−→
                                                                      ht = LSTM(ht−1 , xt )
as learning the conditional distribution p(y|x) of                    ←
                                                                      − ←−−−−
a target sentence (translation) y given a source                      ht = LSTM(ht+1 , xt )                    (1)
sentence x. Figure 1 gives the framework of our                             →
                                                                            − ←  −
                                                                      ht = [ht , ht ]
proposed linear time NMT with capsule encoder,
which mainly consists of two components: a con-        Here we produce the sentence-level encoding ht
                                                                                                      →
                                                                                                      −
stant encoder that represents the input source sen-    of word xt by concatenating the forward ht and
                                                                                      ←
                                                                                      −
tence with a fixed-number of vectors, and a de-        backward output vector ht . Thus, the output of
coder which leverages these vectors to generate        BiLSTM encoder are a sequence of vectors H =
the target-side translation. Please note that due to   [h1 , h2 , · · · , hL ] corresponding to the input se-
the fixed-dimension representation of encoder, the     quence.
time complexity of our model could be linear in           On the top of the BiLSTM, we introduce sev-
the number of source words.                            eral aggregation layers to map the variable-length
inputs into the compressed representation. To               is the concatenation of the source sentence repre-
demonstrate the effectiveness of our encoder, we            sention and Wc is the projection matrix. Since
compare it with a more powerful aggregation                 Wc [c1 , · · · , cM ] is calculated in advance, the de-
method, which mainly involves Max and Average               coding time could be linear in the length of the
pooling.                                                    sentence length. At inference stage, we only uti-
   Both Max and Average pooling are the simplest            lize the top-most hidden states st to make the final
ways to aggregate information, which require no             prediction with a softmax layer:
additional parameters while being computation-
ally efficient. Using these two operations, we per-                 p(yt |yt−1 , x) = softmax(Wo st ).         (6)
form pooling along the time step as follows:
                                                              Similar as (Vaswani et al., 2017), we also
        h   max
                  = max([h1 , h2 , · · · , hL ])     (2)    employ a residual connection (He et al., 2016)
                          L                                 around each of the sub-layers, followed by layer
                      1X                                    normalization (Ba et al., 2016).
         havg =          hi                          (3)
                      L
                         i=1
                                                            3     Aggregation layers with Capsule
Moreover, because the last time step state hL and                 Networks
the first time step state h1 provide complimentary
information, we also exploit them to enrich the fi-
nal output representation of encoder. Formally, the
output representation of encoder consists of four
vectors,

              C = [hmax , havg , h1 , hL ].          (4)

The last time step state hL and the first time step
state h1 provide complimentary information, thus
improve the performance. The compressed rep-
resention C is fixed for the subsequent translation
generation, therefore its quality directly affects the
success of building the Enc-Dec paradigm. In this           Figure 2: Capsule encoder with dynamic routing by
work, we mainly focus on how to introduce ag-               agreement
gregation layers with capsule networks to accu-
rately produce the compressed representation of                The aggregation layers with capsule networks
input sentences, of which details will be provided          play a crucial role in our model. As shown
in Section 3.                                               in Figure 2, unlike the traditional linear-time
LSTM Decoder The goal of the LSTM                           approaches collecting information in a bottom-
is to estimate the conditional proba-                       up approach without considering the state of the
bility          p(yt |y
tomatically learn child-parent (or part-whole) rela-       This coefficient btij is simply a temporary value
tionships. Formally, ui→j denotes the information          that will be iteratively updated with the value bt−1
                                                                                                              ij
be transferred from the child capsule hi into the          of the previous iteration and the scalar product of
parent capsule cj :                                        ct−1
                                                             j   and f (ht−1     t−1
                                                                           j , θ j ), which is essentially the
                                                           similarity between the input to the capsule and
                ui→j = αij f (hi , θ j )            (7)
                                                           the output from the capsule. Likewise, remember
where αij ∈ R can be viewed as the voting weight           from above, the lower level capsule will send its
on the information flow from child capsule to              output to the higher level capsule with similar out-
the parent capsule; f (hi , θ j ) is the transformation    put. Particularly, b0ij is initialized with 0. The co-
function and in this paper, we use a single layer          efficient depends on the location and type of both
feed forward neural networks:                              the child and the parent capsules, which iteratively
                                                           refinement of bij . The capsule network can in-
            f (hi , θ j ) = ReLU(hi Wj )            (8)    crease or decrease the connection strength by dy-
                                                           namic routing. It is more effective than the previ-
where Wj ∈ Rdc ×dc is the transformation matrix            ous routing strategies such as max-pooling which
corresponding to the position j of the parent cap-         essentially detects whether a feature is present in
sule.                                                      any position of the text, but loses spatial informa-
  Finally, the parent capsule aggregates all the in-       tion about the feature.
coming messages from all the child capsules:
                            L
                                                           Algorithm 1 Dynamic Routing Algorithm
                            X
                    vi =          uj→i              (9)     1:   procedure ROUTING([h1 , h2 , · · · , hL ],T )
                            j=1                             2:       Initialize b0ij ← 0
                                                            3:       for each t ∈ range(0 : T ) do
and then squashes vi to ||vi || ∈ (0, 1) confine.                                                                 t
                                                            4:            Compute the routing coefficients αij
ReLU or similar non linearity functions work well
                                                                 for all i ∈ [1, L], j ∈ [1, M ] . From Eq.(11)
with single neurons. However we find that this
                                                            5:            Update all the output capsule ctj for all
squashing function works best with capsules. This
                                                                 j ∈ [1, M ]                    . From Eq.(7,8,9,10)
tries to squash the length of output vector of a cap-
                                                            6:            Update all the coefficient btij for all
sule. It squashes to 0 if it is a small vector and tries
                                                                 i ∈ [1, L], j ∈ [1, M ]             . From Eq.(12)
to limit the output vector to 1 if the vector is long.
                                                            7:       end for
                                                            8:   return [c1 , c2 , · · · , cM ]
                ci = squash(vi )                            9:   end procedure
                         ||vi ||2     vi           (10)
                   =
                       1 + ||vi ||2 ||vi ||                   When an output capsule ctj receives the incom-
                                                           ing messages uti→j , its state will be updated and
3.2   Dynamic Routing by Agreement                                          t is also re-computed for the in-
                                                           the coefficient αij
The dynamic routing process is implemented via             put capsule ht−1
                                                                         i . Thus, we iteratively refine the
an EM iterative process of refining the coupling           route of information flowing, towards an instance
coefficient αij , which measures proportionally            dependent and context aware encoding of a se-
how much information is to be transferred from             quence. After the source input is encoded into M
hi to cj .                                                 capsules, we map these capsules into vector rep-
  At iteration t, the coupling coefficient αij is          resentation by simply concatenating all capsules:
computed by
                           t−1
                      exp(bij  )                                           C = [c1 , c2 , · · · , cM ]         (13)
                 t
                αij =P        t−1                  (11)
                      k exp(bik )                          Finally matrix C will then be fed to the final end to
       P                                                   end NMT model as the source sentence encoder.
where αj αij. This ensures that all the informa-             In this work, we also explore three strategies to
tion from the child capsule hi will be transferred         improve the accuracy of the routing process.
to the parent.
                                                           Position-aware Routing strategy The routing
                     t−1
             btij = bij  + ctj · f (hti , θ tj )   (12)    process iteratively decides what and how much in-
formation is to be sent to the final encoding with      Here g(·) is a fully connected feed-forward net-
considering the state of both the final outputs cap-    work, which consists of two linear transformations
sule and the inputs capsule. In order to fully ex-      with a ReLU activation and is applied to each po-
ploit the order of the child and parent capsules        sition separately.
to capture the child-parent relationship more ef-
ficiently, we add “positional encoding” to the rep-     4     Experiments
resentations of child and parent capsules. In this
                                                        4.1    Datasets
way, some information can be injected according
to the relative or absolute position of capsules in     We mainly evaluated C APS NMT on the widely
the sequence. In this aspect, there have many           used WMT English-German and English-French
choices of positional encodings proposed in many        translation task. The evaluation metric is BLEU.
NLP tasks (Gehring et al., 2017b; Vaswani et al.,       We tokenized the reference and evaluated the per-
2017). Their experimental results strongly demon-       formance with multi-bleu.pl. The metrics are ex-
strate that adding positional information in the text   actly the same as in previous work (Papineni et al.,
is more effective than in image since there is some     2002).
sequential information in the sentence. In this            For English-German, to compare with the re-
work, we follow (Vaswani et al., 2017) to apply         sults reported by previous work, we used the same
sine and cosine functions of different frequencies.     subset of the WMT 2014 training corpus that con-
                                                        tains 4.5M sentence pairs with 91M English words
Non-sharing Weight Strategy Besides, we ex-             and 87M German words. The concatenation of
plore two types of transformation matrices to gen-      news-test 2012 and news-test 2013 is used as the
erate the message vector ui→j which has been pre-       validation set and news-test 2014 as the test set.
viously mentioned Eq.(7,8). The first one shares           To evaluate at scale, we also report the results
parameters θ across different iterations. In the sec-   of English-French. To compare with the results re-
ond design, we replace the shared parameters with       ported by previous work on end-to-end NMT, we
the non-shared strategy θ t where t is the iteration    used the same subset of the WMT 2014 training
step during the dynamic process. We will com-           corpus that contains 36M sentence pairs. The con-
pare the effects of these two strategies in our ex-     catenation of news-test 2012 and news-test 2013
periments. In our preliminary, we found that non-       serves as the validation set and news-test 2014 as
shared weight strategy works slightly better than       the test set.
the shared one which is in consistent with (Liao
and Poggio, 2016).                                      4.2    Training details
Separable Composition and Scoring strategy              Our training procedure and hyper parameter
The most important idea behind capsule networks         choices are similar to those used by (Vaswani
is to measure the input and output similarity. It is    et al., 2017). In more details, For English-German
often modeled as a dot product function between         translation and English-French translation, we use
the input and the output capsule, and the routing       50K sub-word tokens as vocabulary based on
coefficient is updated correspondingly. Traditional     Byte Pair Encoding(Sennrich et al., 2015). We
capsule networks often resort to a straightforward      batched sentence pairs by approximate length, and
strategy in which the “fusion” decisions (e.g., de-     limited input and output tokens per batch to 4096
ciding the voting weight ) are made based on the        per GPU.
values of feature-maps. This is essentially a soft         During training, we employed label smoothing
template matching (Lawrence et al., 1997), which        of value  = 0.1(Pereyra et al., 2017). We used
works for tasks like classification, however, is un-    a beam width of 8 and length penalty α = 0.8
desired for maintaining the composition function-       in all the experiments. The dropout rate was set
ality of capsules. Here, we propose to employ           to 0.3 for the English-German task and 0.1 for the
separate functional networks to release the scoring     English-French task. Except when otherwise men-
duty, and let θ defined in Eq.(7) be responsible for    tioned, NMT systems had 4 layers encoders fol-
composition. More specifically, we redefined the        lowed by a capsule layer and 3 layers decoders.
iteratively scoring function in Eq.(12) as follow,      We trained for 300,000 steps on 8 M40 GPUs,
                                                        and averaged the last 50 checkpoints, saved at 30
           btij = bt−1     t        t
                   ij + g(cj ) · g(hi ).        (14)    minute intervals. For our base model, the dimen-
SYSTEM                         Architecture                    Time           EN-Fr BLEU       EN-DE BLEU
 Buck et al. (2014)             Winning WMT14                     -               35.7              20.7
                                     Existing Enc-Dec Att NMT systems
 Wu et al. (2016)               GNMT + Ensemble                |S||T |              40.4             26.3
 Gehring et al. (2017a)         ConvS2S                        |S||T |              40.5             25.2
 Vaswani et al. (2017)          Transformer (base)         |S||T |+|T ||T |         38.1             27.3
 Vaswani et al. (2017)          Transformer (large)        |S||T |+|T ||T |         41.0             27.9
                                       Existing Enc-Dec NMT systems
 Luong et al. (2015)            Reverse Enc-Dec               |S|+|T |               -               14.0
 Sutskever et al. (2014)        Reverse stack Enc-Dec         |S|+|T |              30.6              -
 Zhou et al. (2016)             Deep Enc-Dec                  |S|+|T |              36.3             20.6
 Kalchbrenner et al. (2016)     ByteNet                      c|S|+c|T |              -               23.7
                                             C APS NMT systems
 Base Model                     Simple Aggregation           c|S|+c|T |             37.1             21.3
 Base Model                     C APS NMT                    c|S|+c|T |             39.6             25.7
 Big Model                      C APS NMT                    c|S|+c|T |             40.0             26.4
 Base Model                     Simple Aggregation + KD      c|S|+c|T |             38.6             23.4
 Base Model                     C APS NMT + KD               c|S|+c|T |             40.4             26.9
 Big Model                      C APS NMT+ KD                c|S|+c|T |             40.6             27.6

Table 1: Case-sensitive BLEU scores on English-German and English-French translation. KD indicates knowl-
edge distillation (Kim and Rush, 2016).

sions of all the hidden states were set to 512 and     is the previous state-of-the-art system which has
for the big model, the dimensions were set to 1024.    150 convolutional encoder layers and 150 convo-
The capsule number is set to 6.                        lutional decoder layers.
   Sequence-level knowledge distillation is ap-           On the English-to-German task, our big C AP -
plied to alleviate multimodality in the training       S NMT achieves the highest BLEU score among
dataset, using the state-of-the-art transformer base   all the Enc-Dec approaches which even outper-
models as the teachers(Kim and Rush, 2016).            form ByteNet, a relative strong competitor, by
We decode the entire training set once using the       +3.9 BLEU score. In the case of the larger
teacher to create a new training dataset for its re-   English-French task, we achieves the compara-
spective student.                                      ble BLEU score among all the with Big Trans-
                                                       form, a relative strong competitor with a gap of
4.3   Results on English-German and                    only 0.4 BLEU score. To show the power of
      English-French Translation                       the capsule encoder, we also make a comparison
The results on English-German and English-             with the simple aggregation version of the Enc-
French translation are presented in Table 1. We        Dec model, and again yields a gain of +1.8 BLEU
compare C APS NMT with various other systems           score on English-German task and +3.5 BLEU
including the winning system in WMT’14 (Buck           score on English-French task for the base model.
et al., 2014), a phrase-based system whose lan-        The improvements is in consistent with our intu-
guage models were trained on a huge monolingual        ition that the dynamic routing policy is more effec-
text, the Common Crawl corpus. For Enc-Dec Att         tive than the simple aggregation method. It is also
systems, to the best of our knowledge, GNMT            worth noting that for the small model, the capsule
is the best RNN based NMT system. Trans-               encoder approach get an improvement of +2.3
former (Vaswani et al., 2017) is currently the         BLEU score over the Base Transform approach
SOTA system which is about 2.0 BLEU points             on English-French task. Knowledge also helps a
better than GNMT on the English-German task            lot to bridge the performance gap between C AP -
                                                       S NMT and the state-of-the-art transformer model.
and 0.6 BLEU points better than GNMT on the
English-French task. For Enc-Dec NMT, ByteNet            The first column indicates the time complexity
of the network as a function of the length of the            model, resulting in an increase of 0.4 BLEU
sequences and is denoted by Time. The ByteNet,               score.
the RNN Encoder-Decoder are the only networks
                                                          • Knowledge Distillation Sequence-level
that have linear running time (up to the constant
                                                            knowledge distillation can still achive an
c). The RNN Enc-Dec, however, does not pre-
                                                            increase of +1.2 BLEU.
serve the source sequence resolution, a feature
that aggravates learning for long sequences. The       4.5     Model Analysis
Enc-Dec Att do preserve the resolution, but at a       In this section, We study the attribution of C AP -
cost of a quadratic running time. The ByteNet          S NMT.
overcomes these problem with the convolutional
neural network, however the architecture must be       Effects of Iterative Routing We also study how
deep enough to capture the global information of       the iteration number affect the performance of ag-
a sentence. The capsule encoder makes use of           gregation on the English-German task. Figure 3
the dynamic routing policy to automatically learn      shows the comparison of 2 − 4 iterations in the dy-
the part-whole relationship and encode the source      namic routing process. The capsule number is set
sentence into fixed size representation. With the      to 2, 4, 6 and 8 for each comparison respectively.
capsule encoder, C APS NMT keeps linear running        We found that the performances on several differ-
time and the constant c is the capsule number          ent capsule number setting reach the best when it-
which is set to 6 in our mainly experiments.           eration is set to 3. The results indicate the dynamic
                                                       routing is contributing to improve the performance
4.4   Ablation Experiments                             and a larger capsule number often leads to better
In this section, we evaluate the importance of our     results.
main techniques for training C APS NMT. We be-                       26
                                                                               2   caps
lieve that these techniques are universally applica-                           4   caps
ble across different NLP tasks, and should always                   25.5
                                                                               6   caps
                                                                               8   caps
be employed by capsule networks for best perfor-                     25
mance. From Table 2 we draw the following con-
                                                             BLEU

                                                                    24.5
 Model                                     BLEU
                                                                     24
 Base C APS NMT                             24.7
 + Non-weight sharing                       25.1                    23.5
 + Position-aware Routing Policy            25.3
                                                                           2              3               4   5
 + Separable Composition and Scoring        25.7                                              Iteration
 +Knowledge Distillation                    26.9
                                                       Figure 3: Effects of Iterative Routing with different
Table 2: English-German task: Ablation experiments     capsule numbers
of different technologies.

clusions:                                                           Model                 Num       Latency(ms)
   • Non-weight sharing strategy We observed                        Transformer           -             225
     that the non-weight sharing strategy im-                       C APS NMT             4             146
     proves the baseline model leading to an in-                    C APS NMT             6             153
     crease of 0.4 BLEU.                                            C APS NMT             8             168

   • Position-aware Routing strategy Adding            Table 3: Time required for decoding with the base
     the position embedding to the child capsule       model. Num indicates the capsule number. Decoding
     and the parent capsule can obtain an improve-     indicates the amount of time in millisecond required
     ment of 0.2 BLEU score.                           for translating one sentence, which is averaged over the
                                                       whole English-German newstest2014 dataset.
   • Separable Composition and Scoring strat-
     egy Redefinition of the dot product function      Analysis on Decoding Speed We show the de-
     contributes significantly to the quality of the   coding speed of both the transformer and C AP -
S NMT in Table 3. Thg results empirically demon-             mentioning that the capsule seems able to capture
strates that C APS NMT can improve the decoding              some structure information. For example, the in-
speed of the transformer approach by +50%.                   formation of phrase still love each other will be
                                                             sent to the same capsule. We will make further
Performance on long sentences A more de-
                                                             exploration in the future work.
tailed comparison between C APS NMT and Trans-
former can be seen in Figure 4. In particular,
we test the BLEU scores on sentences longer                  5   Related Work
than {0, 10, 20, 30, 40, 50}. We were surprised
to discover that the capsule encoder did well on             Linear Time Neural Machine Translation
medium-length sentences. There is no degrada-                Several papers have proposed to use neural net-
tion on sentences with less than 40 words, how-              works to directly learn the conditional distribution
ever, there is still a gap on the longest sentences.         from a parallel corpus(Kalchbrenner and Blun-
A deeper capsule encoder potentially helps to ad-            som, 2013; Sutskever et al., 2014; Cho et al.,
dress the degradation problem and we will leave              2014; Kalchbrenner et al., 2016). In (Sutskever
this in the future work.                                     et al., 2014), an RNN was used to encode a source
                                                             sentence and starting from the last hidden state,
            28       CapsNMT                                 to decode a target sentence. Different the RNN
                     Transformer
                                                             based approach, Kalchbrenner et al., (2016) pro-
                                                             pose ByteNet which makes use of the convolution
            27
                                                             networks to build the liner-time NMT system. Un-
     BLEU

                                                             like the previous work, the C APS NMT encodes
                                                             the source sentence with an iterative process to
            26                                               decide the credit attribution between nodes from
                                                             lower and higher layers.

                 0      10      20    30       40   50
                                                             Capsule Networks for NLP Currently, much
                             Sentence Length
                                                             attention has been paid to how developing a so-
                                                             phisticated encoding models to capture the long
Figure 4: The plot shows the performance of our sys-
tem as a function of sentence length, where the x-axis       and short term dependency information in a se-
corresponds to the test sentences sorted by their length.    quence. Gong et al., (2018) propose an aggre-
                                                             gation mechanism to obtain a fixed-size encoding
                                                             with a dynamic routing policy. Zhao et al., (2018)
   Orlando Bloom and Miranda Kerr still love each other
                                                             explore capsule networks with dynamic routing
  Orlando Bloom and Miranda Kerr still love each other
                                                             for multi-task learning and achieve the best per-
  Orlando Bloom and Miranda Kerr still love each other
                                                             formance on six text classification benchmarks.
  Orlando Bloom and Miranda Kerr still love each other
                                                             Wang et al., (2018) propose RNN-Capsule, a cap-
                                                             sule model based on Recurrent Neural Network
Table 4: A visualization to show the perspective of a
                                                             (RNN) for sentiment analysis. To the best of our
sentence from 4 different capsules at the third iteration.   knowledge, C APS NMT is the first work that cap-
                                                             sule networks have been empirically investigated
Visualization We visualize how much informa-                 for sequence-to-sequence problems.
tion each child capsule sends to the parent cap-
sules. As shown in Table 4, the color density of             6   Conclusion
each word denotes the coefficient αij at iteration
3 in Eq.(11). At first iteration, the αij follows            We have introduced C APS NMT for linear-time
a uniform distribution since bij is initialized to 0,        NMT. Three strategies were proposed to boost the
and αij then will be iteratively fitted with dynamic         performance of dynamic routing process. The em-
routing policy. It is appealing to find that afte 3 it-      perical results show that C APS NMT can achieve
erations, the distribution of the voting weightsαij          state-of-the-art results with better decoding la-
will converge to a sharp distribution and the val-           tency in several benchmark datasets.
ues will be very close to 0 or 1. It is also worth
References                                               Philipp Koehn, Franz Josef Och, and Daniel Marcu.
                                                           2003. Statistical phrase-based translation. In
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-        Proceedings of the 2003 Conference of the North
   ton. 2016. Layer normalization. arXiv preprint          American Chapter of the Association for Computa-
   arXiv:1607.06450.                                       tional Linguistics on Human Language Technology-
                                                           Volume 1, pages 48–54. Association for Computa-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-           tional Linguistics.
  gio. 2014. Neural machine translation by jointly
  learning to align and translate. arXiv preprint        Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and An-
  arXiv:1409.0473.                                          drew D Back. 1997. Face recognition: A convolu-
                                                            tional neural-network approach. IEEE transactions
Christian Buck, Kenneth Heafield, and Bas Van Ooyen.        on neural networks, 8(1):98–113.
  2014. N-gram counts and language models from the
  common crawl. In LREC, volume 2, page 4. Cite-         Qianli Liao and Tomaso Poggio. 2016. Bridging
  seer.                                                    the gaps between residual learning, recurrent neu-
                                                           ral networks and visual cortex. arXiv preprint
David Chiang. 2005. A hierarchical phrase-based            arXiv:1604.03640.
  model for statistical machine translation. In Pro-
  ceedings of the 43rd Annual Meeting on Association     Minh-Thang Luong, Hieu Pham, and Christopher D
  for Computational Linguistics, pages 263–270. As-        Manning. 2015. Effective approaches to attention-
  sociation for Computational Linguistics.                 based neural machine translation. arXiv preprint
                                                           arXiv:1508.04025.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-         Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger          Jing Zhu. 2002. Bleu: a method for automatic eval-
  Schwenk, and Yoshua Bengio. 2014. Learning               uation of machine translation. In Proceedings of
  phrase representations using rnn encoder-decoder         the 40th annual meeting on association for compu-
  for statistical machine translation. arXiv preprint      tational linguistics, pages 311–318. Association for
  arXiv:1406.1078.                                         Computational Linguistics.
Jonas Gehring, Michael Auli, David Grangier, De-         Gabriel Pereyra, George Tucker, Jan Chorowski,
  nis Yarats, and Yann N. Dauphin. 2017a. Con-             Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular-
  volutional sequence to sequence learning. CoRR,          izing neural networks by penalizing confident output
  abs/1705.03122.                                          distributions. arXiv preprint arXiv:1701.06548.

Jonas Gehring, Michael Auli, David Grangier, Denis       Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton.
  Yarats, and Yann N Dauphin. 2017b. Convolu-              2017. Dynamic routing between capsules. neural
  tional sequence to sequence learning. arXiv preprint     information processing systems, pages 3856–3866.
  arXiv:1705.03122.
                                                         Rico Sennrich, Barry Haddow, and Alexandra Birch.
Jingjing Gong, Xipeng Qiu, Shaojing Wang, and Xuan-        2015. Neural machine translation of rare words with
   jing Huang. 2018. Information aggregation via dy-       subword units. CoRR, abs/1508.07909.
   namic routing for sequence encoding. international
   conference on computational linguistics.              Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua
                                                           Wu, Maosong Sun, and Yang Liu. 2015. Minimum
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian          risk training for neural machine translation. arXiv
  Sun. 2016. Deep residual learning for image recog-       preprint arXiv:1512.02433.
  nition. In Proceedings of the IEEE Conference on
                                                         Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
  Computer Vision and Pattern Recognition, pages
                                                            Sequence to sequence learning with neural net-
  770–778.
                                                            works. In Advances in neural information process-
                                                            ing systems, pages 3104–3112.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
  continuous translation models. In Proceedings of       Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  the 2013 Conference on Empirical Methods in Nat-         Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  ural Language Processing, pages 1700–1709.               Kaiser, and Illia Polosukhin. 2017. Attention is all
                                                           you need. In Advances in Neural Information Pro-
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,          cessing Systems, pages 6000–6010.
  Aaron van den Oord, Alex Graves, and Koray
  Kavukcuoglu. 2016. Neural machine translation in       Yequan Wang, Aixin Sun, Jialong Han, Ying Liu, and
  linear time. arXiv preprint arXiv:1610.10099.            Xiaoyan Zhu. 2018. Sentiment analysis by capsules.
                                                           In Proceedings of the 2018 World Wide Web Con-
Yoon Kim and Alexander M Rush. 2016. Sequence-             ference on World Wide Web, pages 1165–1174. In-
  level knowledge distillation.  arXiv preprint            ternational World Wide Web Conferences Steering
  arXiv:1606.07947.                                        Committee.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
  Le, Mohammad Norouzi, Wolfgang Macherey,
  Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  Macherey, et al. 2016.       Google’s neural ma-
  chine translation system: Bridging the gap between
  human and machine translation. arXiv preprint
  arXiv:1609.08144.
Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei
  Zhang, and Zhou Zhao. 2018. Investigating capsule
  networks with dynamic routing for text classifica-
  tion. arXiv: Computation and Language.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei
   Xu. 2016. Deep recurrent models with fast-forward
   connections for neural machine translation. arXiv
   preprint arXiv:1606.04199.
You can also read