Nested-Wasserstein Self-Imitation Learning for Sequence Generation

Page created by Tammy Patel
 
CONTINUE READING
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

                                          Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1
                                               1
                                                 Duke University 2 University at Buffalo 3 Microsoft Dynamics 365 AI Research 4 DeepMind
                                                                                  ryzhang@cs.duke.edu
arXiv:2001.06944v1 [cs.CL] 20 Jan 2020

                                                                Abstract                                  log-likelihood of the next word conditioned on its pre-
                                                                                                          ceding ground-truth partial sentence. However, when
                                                                                                          testing, the generated partial sequence is fed to the
                                              Reinforcement learning (RL) has been widely
                                                                                                          generator to draw the next token. Such a discrepancy
                                              studied for improving sequence-generation
                                                                                                          between training and testing, commonly known as ex-
                                              models. However, the conventional rewards
                                                                                                          posure bias, leads to accumulated approximation errors
                                              used for RL training typically cannot capture
                                                                                                          along the sequence-generation trajectory [Bengio et al.,
                                              sufficient semantic information and therefore
                                                                                                          2015, Ranzato et al., 2016].
                                              render model bias. Further, the sparse and
                                              delayed rewards make RL exploration ineffi-                 To address exposure bias, reinforcement learning (RL)
                                              cient. To alleviate these issues, we propose                techniques have been introduced [Ranzato et al., 2016].
                                              the concept of nested-Wasserstein distance                  Unlike MLE, which only leverages training examples,
                                              for distributional semantic matching. To fur-               RL can also exploit samples drawn from the current
                                              ther exploit it, a novel nested-Wasserstein                 policy. Improvements are gained from reinforcing the
                                              self-imitation learning framework is developed,             training towards more-plausible generations, typically
                                              encouraging the model to exploit historical                 based on a user-specified reward function [Ranzato
                                              high-rewarded sequences for enhanced explo-                 et al., 2016, Yu et al., 2017]. However, the manually
                                              ration and better semantic matching. Our                    designed rewards often target specific desirable prop-
                                              solution can be understood as approximately                 erties in sequence generation (e.g., matching n-gram
                                              executing proximal policy optimization with                 overlap between generated sequences and ground-truth
                                              Wasserstein trust-regions. Experiments on                   references), which unintentionally induces extra bias
                                              a variety of unconditional and conditional                  and is often criticized as a bad proxy for human evalu-
                                              sequence-generation tasks demonstrate the                   ation [Wang et al., 2018a, Hu et al., 2019]. Concerns
                                              proposed approach consistently leads to im-                 have also been raised w.r.t. efficient exploration in
                                              proved performance.                                         sequence generation. In existing RL-based methods
                                                                                                          for sequence generation [Bahdanau et al., 2017, Ran-
                                                                                                          zato et al., 2016, Rennie et al., 2016], all experiences
                                         1    Introduction                                                are treated as equivalent. However, merely relying on
                                                                                                          policy samples to explore often leads to forgetting a
                                         Sequence generation is an important research topic in            high-reward trajectory, unless it can be re-sampled fre-
                                         machine learning, covering a wide range of applications,         quently [Liang et al., 2018]. This problem becomes
                                         including machine translation [Bahdanau et al., 2015,            severe in the sparse-reward setting in sequence genera-
                                         Cho et al., 2014, Sutskever et al., 2014], image cap-            tion, i.e., the reward is only available after the whole
                                         tioning [Anderson et al., 2017, Vinyals et al., 2015, Xu         sentence is generated.
                                         et al., 2015], and text summarization [Paulus et al.,            Motivated by the above observations, we present a novel
                                         2017, Rush et al., 2015]. Standard sequence generation           nested-Wasserstein Self-Imitation Learning (WSIL)
                                         follows an auto-regressive model design under maxi-              framework for sequence generation. Specifically, we
                                         mum likelihood estimation (MLE) learning [Huszár,                propose the nested-Wasserstein distance, a general-
                                         2015, Sutskever et al., 2014, Wiseman and Rush, 2016].           ization of the Wasserstein distance, and exploit it to
                                         That is, models are trained to maximize the expected             measure distance between the behavior policy and the
                                                                                                          artificial policy defined by the replay buffer to encour-
                                         Proceedings of the 23rd International Conference on Artificial   age self-imitation. The nested-Wasserstein distance
                                         Intelligence and Statistics (AISTATS) 2020, Palermo, Italy.
                                                                                                          is well suited for distributional semantic matching be-
                                         PMLR: Volume 108. Copyright 2020 by the author(s).
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

tween two (sequence) distributions whose samples are            with deterministic state transition and sparse reward.
still discrete distributions, as in the case of sequence        It can be formulated as a Markov decision process
generation. The proposed WSIL is inspired by and                (MDP) M = hS, A, P, ri, where S is the state space, A
derived from the policy optimization with Wasserstein           is the action space, P is the deterministic environment
trust-regions [Zhang et al., 2018b]. It provides a novel        dynamics and r(s, y) is a reward function. The policy
reward function to match the generated sequences with           πθ , parameterized by θ, maps each state s ∈ S to a
the high-reward sequences in the replay buffer, encour-         probability distribution over A. The objective is to
aging distributional semantic matching rather than              maximize the expected reward, defined as:
simple n-gram overlapping.                                                                XT
                                                                J(πθ ) = EY ∼πθ [r(Y )] =     E(st ,yt )∼πθ [r(st , yt )] , (3)
The main contributions of this paper are summarized                                         t=1
as follows. (i) A novel nested-Wasserstein self-imitation       where Y , (s1 , y1 , · · · , sT , yT ) is a trajectory from
learning framework is developed for sequence genera-            policy πθ with yt ∈ A, and r(Y ) represents the reward
tion, exploiting historical good explorations for better        for a sentence Y , and r(st , yt ) is the step-wise reward.
future exploration. (ii) A novel nested-Wasserstein             RL seeks to learn an optimal policy, that maximizes
distance is introduced for sequence generation via dis-         the expected total reward J(πθ ).
tributional semantic matching, effectively alleviating
the model training bias imposed by conventional re-             Optimal transport on discrete domains The op-
wards. (iii) Extensive empirical evaluation is performed        timal transport (OT) distance Wc (µ, ν) is a discrep-
on both unconditional and conditional text generation           ancy score that measures the distance between two
tasks, demonstrating consistent performance improve-            probability distributions µ(·) and ν(·) w.r.t. a cost
ment over existing state-of-the-art approaches.                 function c(·, ·). Specifically,
                                                                                    Pn           we consider
                                                                                                          Pmtwo discrete
                                                                distributions µ , i=1 ui δzi and ν , j=1 vj δzj0 with
2    Background                                                 δz the Dirac delta function centered on z. The weight
                                                                vectors u = {ui }ni=1 ∈ ∆n and v = {vj }m     j=1 ∈ ∆m re-
Sequence-generation model We consider the                       spectively  belong  to the  n and   m-dimensional     simplex,
                                                                      Pn           Pm
problem of discrete sequence generation, which learns           i.e.,   i=1 u i =     j=1 v j =  1.   Accordingly,     Wasser-
to generate a sequence Y = (y1 , . . . , yT ) ∈ Y of length     stein distance is equivalent to solving the following
T , possibly conditioned on context X. Here each yt is          minimization problem:           m X  n
a token from vocabulary A. Pairs (X, Y ) are used for                                          X
                                                                       Wc (µ, ν) = min                 Tij · c(zi , zj0 )
training a sequence-generation model. We are particu-                                T∈Γ(µ,ν)
                                                                                               i=1 j=1                    (4)
larly interested in applications to text generation, where
Y is a sentence and each yt is a word. Starting from                               = min hT, Ci ,
                                                                                     T∈Γ(µ,ν)
the initial state s0 , a recurrent neural network (RNN)                Pn
                                                                where j=1 Tij = m      1
                                                                                               Pm
                                                                                          and i=1 Tij = n1 are the con-
produces a sequence of states (s1 , . . . , sT ) given an in-   straints, h·, ·i represents the Frobenius dot-product,
put sequence-feature representation (e(y1 ), . . . , e(yT )),   and C is the cost matrix defined by Cij = c(zi , zj0 ).
where e(·) denotes a word embedding mapping a token             Intuitively, the OT distance is the minimal cost of
to its d-dimensional feature representation. The states         transporting mass from µ to ν.
are recursively updated with a function known as the
cell: st = hθ (st−1 , e(yt )), where θ denotes the model
parameters. Popular implementations include Long                3    Distributional Semantic Matching
Short-Term Memory (LSTM) [Hochreiter and Schmid-
huber, 1997] and the Gated Recurrent Unit (GRU) [Cho            We first consider evaluating the sentence from syntac-
et al., 2014]. In order to generate sequence Y s from a         tic and semantic perspectives. Conventional metric
(trained) model, one iteratively applies the following          rewards (e.g., BLEU) can capture the syntactic struc-
operations: s                                                   ture better, where the exact matching of words (or
             yt+1 ∼ Multi(softmax(g(st ))) ,             (1)    short phases) to the reference sequences is encouraged,
               st = h(st−1 , e(yts )) ,                  (2)    which induces strong bias in many cases. As such, we
                                                                focus on the semantic matching and propose the nested-
where Multi(·) denotes a multinomial distribution. In           Wasserstein distance, which defines the distance be-
conditional generation, s0 is initialized with Enc(X),          tween two sequence distributions. Nested-Wasserstein
where Enc(·) encodes the relevant information from the          distance provides a natural way to manifest semantic
context [Bahdanau et al., 2017, Cho et al., 2014]. For          matching compared with the conventional rewards used
unconditional generation, one typically draws s0 from           in existing RL-based sequence models. Alternatively,
a standard Gaussian distribution.                               we can train a discriminator to learn the reward model,
Sequence generation as an RL problem Se-                        but empirically it only rewards high-quality genera-
quence generation can be considered as an RL problem            tions, even though they may be characterized by mode
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Ruiyi Zhang1       Changyou Chen2            Zhe Gan3        Zheng Wen4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Wenlin Wang1     Lawrence Carin1

  Candidate 1 (C1): There are six freshmen reading papers .                     (PY , Wnc )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Distribution-level
                                                                                    AAACDXicbVDLSsNAFJ3UV62vqEs3g1WoICWpBV0W3LisYB/ShDCZTNqhk0yYmQgl5Afc+CtuXCji1r07/8ZJ7EKrF4Y5nHsv95zjJ4xKZVmfRmVpeWV1rbpe29jc2t4xd/f6kqcCkx7mjIuhjyRhNCY9RRUjw0QQFPmMDPzpZdEf3BEhKY9v1CwhboTGMQ0pRkpTnnnUcCKkJr6fdXMvc3zOAjmL9Adv81M48LIY5yeeWbeaVlnwL7DnoA7m1fXMDyfgOI1IrDBDUo5sK1FuhoSimJG85qSSJAhP0ZiMNIxRRKSblW5yeKyZAIZc6BcrWLI/NzIUyUKjniyky8VeQf7XG6UqvHAzGiepItpWeShMGVQcFtHAgAqCFZtpgLCgWivEEyQQVjrAmg7BXrT8F/RbTfus2bpu1zvteRxVcAAOQQPY4Bx0wBXogh7A4B48gmfwYjwYT8ar8fY9WjHmO/vgVxnvXzs2m5o=

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         (Y, Wc )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sequence-level
   Reference:         There are six freshmen playing football .
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  AAAB+3icbVDLSsNAFL3xWesr1qWbwSJUkJLUgi4LblxWsA9pQ5hMJ+3QyYOZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHizmTyrK+jbX1jc2t7dJOeXdv/+DQPKp0ZZQIQjsk4pHoe1hSzkLaUUxx2o8FxYHHac+b3uR+75EKyaLwXs1i6gR4HDKfEay05JqV2jDAakIwTx+yC9RzyblrVq26NQdaJXZBqlCg7Zpfw1FEkoCGinAs5cC2YuWkWChGOM3Kw0TSGJMpHtOBpiEOqHTSefYMnWllhPxI6BcqNFd/b6Q4kHIWeHoyDyqXvVz8zxskyr92UhbGiaIhWRzyE45UhPIi0IgJShSfaYKJYDorIhMsMFG6rrIuwV7+8irpNur2Zb1x16y2mkUdJTiBU6iBDVfQgltoQwcIPMEzvMKbkRkvxrvxsRhdM4qdY/gD4/MHxkqTkg==

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         (A, ccos )
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         AAACAHicbVDLSsNAFJ34rPUVdeHCzWARKkhJakGXFTcuK9gHNCFMptN26GQmzEyEErLxV9y4UMStn+HOv3HSZqGtBy4czrmXe+8JY0aVdpxva2V1bX1js7RV3t7Z3du3Dw47SiQSkzYWTMheiBRhlJO2ppqRXiwJikJGuuHkNve7j0QqKviDnsbEj9CI0yHFSBspsI+rXoT0GCOW3mQXEAeph4XKzgO74tScGeAycQtSAQVagf3lDQROIsI1ZkipvuvE2k+R1BQzkpW9RJEY4Qkakb6hHEVE+ensgQyeGWUAh0Ka4hrO1N8TKYqUmkah6cyvVYteLv7n9RM9vPZTyuNEE47ni4YJg1rAPA04oJJgzaaGICypuRXiMZIIa5NZ2YTgLr68TDr1mntZq983Ks1GEUcJnIBTUAUuuAJNcAdaoA0wyMAzeAVv1pP1Yr1bH/PWFauYOQJ/YH3+AOw3le4=

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Word-level
  Candidate 2 (C2):   Six freshmen are playing soccer .

       BLEU      ROUGE-L        CIDEr     Naive     Wasserstein      Figure 1: Illustration of nested-Wasserstein distance
 C1     36.8       50.0         163.7     84.1         76.3          (Wnc ) over distributions of sequences (PY ), showing
 C2      0.0       35.8          55.9     42.5        80.1           how the distance is defined in a nested manner to
                                                                     measure distance of sequence distributions. ccos is the
Table 1: Comparison of different rewards in terms of                 word ground metric; Wc is the sequence ground metric.
the sequence-level (higher is better). The top figure
illustrates the Wasserstein reward of comparing two                   Nested-Wasserstein distance Our ultimate goal
candidate sentences with a reference sentence, which                  is to measure distance between two policy distributions
will automatically match semantically similar words.                  instead of sequence pairs. Given two sets of sequences
Dominant edges are shown in dark blue, determined                     from two policies, one aims to incorporate the semantic
by the optimal transport matrix T.                                    information between sequences into the distance mea-
                                                                      sure. To this end, we propose the nested-Wasserstein
collapse [He et al., 2019]. This undermines diversity,                distance in Definition 2. Figure 1 illustrates the nested-
an important aspect in evaluation.                                    Wasserstein, considering both word- and sequence-level
To better understand the issue, consider the example                  matching with Wasserstein distance.
on sequence matching in Table 1. One can also use                    Definition 2 (Nested-Wasserstein Distance)
a naive way of semantic matching, i.e., measuring a                  Consider two sets of sequences Y = {Yi }K          i=1 and
                                                                                   0
distance between average word embeddings. It is clear                Y 0 = {Yj0 }K
                                                                                 j=1 drawn  from two sequence    distributions
that while the first candidate sentence has a similar                PY and PY 0 , where K and K 0 are the number of
syntactic structure to the reference, the second can-                sequences in Y and Y 0 . The nested-Wasserstein
didate sentence is more semantically consistent with                 distance, denoted as Wnc (PY , PY 0 ), is a metric
the reference. However, popular hard-matching met-                   measuring the distance between PY and PY 0 defined in
rics [Papineni et al., 2002, Vedantam et al., 2015] and              a nested manner:
                                                                                               K XK0
the naive method consistently indicate the first can-                                          X
didate is a better match to the reference. The above                    Wnc (PY , PY 0 ) , min
                                                                                            s
                                                                                                     Tijs Wc (pYi , pYj0 ) , (5)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         T
                                                                                               i=1 j=1
contradiction can be alleviated if the reward metric is                                                  1
                                                                             Tij0
                                                                                ≥ 0 satisfies i Tijs = K   and j Tijs = K10 ;
                                                                                              P                 P
more semantic-aware. So motivated, the remainder of                  where
this section is devoted to a discussion of design and                and Wc (·, ·) denotes the c-Wasserstein distance defined
implementation of Wasserstein rewards. The general                   in (4).
idea is to match the semantic features via minimizing                Remark 1 The word “nested” comes from the defini-
the Wasserstein distance between hypothesis sentences                tion in (5), which essentially consists of two nested
and their references in the semantic space. A nested                 levels of Wasserstein distances. The proposed nested-
version of the Wasserstein distance arises when inte-                Wasserstein distance brings in the semantic information
grating the distributional semantic matching into the                via the distance measure Wc in the first level distance.
objective of sequence distribution matching.                         Note that we have omitted the expectation over samples
                                                                     in (5) for simplicity, as we essentially use a single set
Definition 1 (Wasserstein Distance between                           of samples to approximate Wnc (·, ·) in algorithms.
Sequence Pairs) Consider sequence        P Y = (y1 , . . . , yT )     Sample-based estimation of nested-Wasserstein
as a discrete distribution pY = T1 t δe(yt ) in the se-
mantic space, with the length-normalized point mass                   distance Computing the exact Wasserstein distance
placed at the word embedding, i.e., zt = e(yt ) of each               is computationally intractable [Arjovsky et al., 2017,
token yt from the sequence Y . Given a hypothesis se-                 Genevay et al., 2018, Salimans et al., 2018], let alone
quence Y w.r.t. a reference sequence Y 0 , we define the              the proposed nested-Wasserstein distance. Fortunately,
Wasserstein distance as Wc (pY , pY 0 ) , minT hT, Ci                 we can employ the recently proposed IPOT algorithm
between pY and pY 0 with cost c(z, z 0 ). When the co-                [Xie et al., 2018] to obtain an efficient approximation.
                                       | 0                            Specifically, IPOT considers the following proximal
sine distance ccos (z, z 0 ) = 1 − kzkz2 kz
                                          z
                                            0k   is used as our
                                               2                      gradient descent to solve the optimal transport
cost, we define the Wasserstein reward as rs (Y, Y 0 ) ,              matrix T via iterative    optimization, i.e., T(t+1) =
hT∗ , 1 − Ci, where T∗ is the optimal transport matrix.               arg minT∈Π(µ,ν) hT, Ci + γ · DKL (T, T(t) ) , where
                                                                                        
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

1/γ > 0 is the generalized step size and the
generalized      KL-divergence    DKL (T, T(t) )    =                   High-Rewards                    Replay
P             Tij     P           P      (t)                             Sequences                      Buffer
  i,j Tij log  (t) −    i,j Tij +   i,j Tij   is used
             Tij
as the proximity metric. Standard Sinkhorn iterations                    Generated                  Sequences from
[Cuturi, 2013] are used to solve the above sub-problem.                  Sequences                   Replay Buffer
The IPOT was designed to approximately calculate the
standard Wasserstein distance. Here we extend it to              Standard                            Wasserstein
                                                                 Rewards
                                                                                 Generator
calculate the nested-Wasserstein distance by applying                                                 Rewards
IPOT twice in a nested manner, i.e., in the sequence
and distribution levels, respectively. The full approach     Figure 2: Illustration of the proposed nested-
of IPOT is summarized as Algorithm 2 in Appendix B.          Wasserstein Self-Imitation Learning (WSIL) framework,
                                                             where Wasserstein self-imitation rewards are defined to
4    Nested-Wasserstein Self-Imitation                       encourage the generator to imitate samples from the
     Learning                                                replay buffer. The standard RL framework is given in
                                                             the gray dotted box.
Purely adopting the nested-Wasserstein distance as the       distance between the behavior policy πθ,X and the arti-
reward in a standard policy-gradient method is not           ficial policy πB,X . Note when K = K 0 = 1, the nested
effective, because the syntactic information is missing.     Wasserstein distance degenerates to the definition of
Specifically, we consider sequences generated from a         Wasserstein distance between two sequences.
conditional behavior policy πθ,X , parameterized by θ        Remark 2 Unconditional Generation: By con-
with the conditional variable X. For example, in im-         sidering samples (features) themselves as discrete dis-
age captioning, each sequence is generated conditioned       tributions, we replace the mean square difference over
on a given image. For unconditional generation, the          features of sequence pairs, i.e., Euclidean norm, with
conditional variable is empty. Instead of combining          the Wasserstein distance. Then for the distributions of
the rewards with different weights [Liu et al., 2017, Pa-    sequences, we again adopt the Wasserstein distance as
sunuru et al., 2017], we present the nested-Wasserstein      in WGAN [Arjovsky et al., 2017] but in the discrete
Self-Imitation Learning (WSIL) framework, which pro-         domain. Thus, the Wasserstein distance is defined in a
vides a novel way of leveraging both syntactic (metric)      nested manner.
and semantic (Wasserstein) information.
                                                             Remark 3 Conditional Generation: We replace
The overall idea of the proposed nested-Wasserstein self-    the exact matching of sequence pairs with metric re-
imitation learning is to define a Wasserstein trust-region   wards in RL training, with the Wasserstein distance. In
between the current policy (a.k.a. behavior policy)          this case, we are matching two conditional distributions
and the artificial policy defined by the replay buffer.      with Wasserstein distance, instead of matching the gen-
Intuitively, the Wasserstein trust-region encourages the     erated sentence with all reference sentences by average.
self-imitation of historical high-reward sequences, which    This is a more suitable way as a generated sentence
provides semantic signals to guide training, in addition     does not necessarily need to match all the references.
to the stabilizing effect from trust-region optimization.
                                                             For simplicity, we sometimes omit the first expectation
Furthermore, a replay buffer is used to store high-
                                                             EX∼pd . With the proposed nested Wasserstein distance,
reward historical sequences, whose induced conditional
                                                             we propose the Wasserstein self-imitation scheme in (6),
policy is denoted πB,X . Our new objective function
                                                             as illustrated in Figure 2. We seek to use historical high-
with a Wasserstein trust-region is defined as:
                                                             reward sequences to define a “self-imitation” reward
       J(πθ ) = EX∼pd {EY s ∼πθ,X [r(Y s )]                  function, which is then combined with the original
                                                      (6)    reward function to update the generator with policy
                        − λ · Wnc (πθ,X , πB,X )} ,
                                                             gradient methods. Intuitively, higher self-imitation
                                                             rewards are achieved when the generated sequences
where Wnc is the nested-Wasserstein distance defined
                                                             are close to historical high-reward sequences. Thus the
in Definition 2, and r(·) can be a metric reward be-
                                                             generator is guided to perform self imitation and we call
tween Y s and the ground-truth references Y . With
                                                             this method indirect nested-Wasserstein self-imitation
a little abuse of notation, but for conciseness, we use
                                                             learning (WSIL-I). The word “indirect” comes from the
πθ to denote both the policy and the distribution over
                                                             mechanism that historical sequences interact with the
the sequences. Distinct from classic trust-region pol-
                                                             policy indirectly via the self-imitation reward.
icy optimization, which defines the trust region based
on KL-divergence [Schulman et al., 2015], WSIL de-           WSIL-I incorporates a self-imitation reward, denoted
fines the trust region based on the nested-Wasserstein       as rs (Y s , Y b ), into the objective function. Here Y b
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Ruiyi Zhang1          Changyou Chen2            Zhe Gan3       Zheng Wen4     Wenlin Wang1       Lawrence Carin1

 denotes a sample from the replay buffer and Y s de-                    Algorithm 1 Nested-Wasserstein Self-Imitation.
 notes a sample from the current policy. To this end,                   Require: Generator policy πθ ; a sequence dataset D =
 we replace the Wasserstein distance Wc in the nested-                    {Y1...T }N
                                                                                   1 ; a possibly empty condition X = {X}1 .
                                                                                                                            N

 Wasserstein distance with rs (Y s , Y b ) in the general ob-             Initialize πθ and replay buffer B.
 jective (6). Specifically, we define the two sets of sample              Pretrain generator πθ with MLE.
 sequences from πθ,X and πB,X to be Y s , {Yis }K       i=1 and           repeat
                0
 Y b , {Yjb }Kj=1 , with  sizes of K  and   K 0
                                                , respectively.             Generate K sequences Y s = {Yks }K      k=1 , where
 Here Yis ∼ πθ,X and Yjb ∼ πB,X , ∀j. {Yis }K          i=1 and
                                                                            Yks ∼ πθ .
 Y b will be used in calculatingPthe nested-Wasserstein                     Update replay buffer B using Y s .
 distance. Let rns (Yis , Y b ) , j Tijs rs (Yis , Yjb ) be the             if Self-Imitation then
                                                                                                                      0
 nested-Wasserstein reward, with Ts = {Tijs } the op-                          Sample K 0 sequences Y b = {Yjb }K   j=1 , where
 timal weights in distribution-level. Based on (6), the                        Yjb ∼ πB .
 objective of WSIL-I is adapted to be:                                         Estimate the OT matrix T and Ts via IPOT
                                                                               Compute rns (Yks , Y b ) and update πθ with (8).
JI (πθ ) , EX∼pd EY s ∼πθ,X r(Y s ) + λrns (Y s , Y b ) , (7)
                                                        
                                                                            else
                                                                               Update the generator πθ with (3) using Y s .
 where r is the original RL reward; rns is the nested-                      end if
 Wasserstein reward. Since not all historically explored                  until Algorithm converges
 samples are helpful for updating the current policy, we
 only consider a subset of the high-reward sequences
 when performing self-imitation. Using K trajectories                                MLE               RL             WSIL
 sampled i.i.d. from πθ and introducing a baseline b,
 the gradient estimate of WSIL-I is expressed as:
                       K
                       X
     ∇θ JI (πθ ) ≈ −       [(r(Yks ) − b)∇θ log πθ (Yks )               Figure 3: Exploration space of different methods. Circle:
                       k=1
                                                                (8)     ground truth; Star: high-reward sequences.
                   +   λrns (Yks , Y b )∇θ   log πθ (Yks )] .           Increasing Self-Imitation According to the theory
                                                                        of Wasserstein gradient flows [Villani, 2008], 1/λ can
 In practice, I r(Y b ) > r(Y s ) will be combined with
                                
                                                                        be interpreted as a generalized decaying learning rate.
 the nested-Wasserstein rewards, where I(·) = 1 if the                  With more explorations, λ becomes larger, and the algo-
 condition is satisfied, and 0 otherwise; b is the baseline             rithm should focus more on the self-imitation learning,
 to stabilize training. If the reward of a historical high-             providing a guideline to balance the standard RL train-
 reward sequence is greater than the current one (i.e.,                 ing and self-imitation learning. More details are pro-
 r(Y b ) > r(Y s )), the generator learns to imitate this               vided in Appendix B. Practically, nested-Wasserstein
 high-reward sequence. Otherwise, the update based on                   provides weak supervision focusing on semantic match-
 the historical sequence is not performed due to the I(·)               ing, which is reasonable since the historical high-reward
 operator. This encourages the agent to only imitate its                sequences contain some noises.
 good historical explorations. We have also developed
 another way to implement (direct) WSIL (WSIL-D) as
                                                                        5   Related Work
 discussed in the Appendix A. Algorithm 1 describes
 the general implementation procedure of the WSIL.                      Optimal transport Kusner et al. [2015] proposed
                                                                        the word mover’s distance (WMD) and first applied
 Exploration Efficiency The exploration space of
                                                                        optimal transport (OT) to NLP; OT has also been em-
 MLE is the examples in the training set [Tan et al.,
                                                                        ployed to improve topic modeling [Huang et al., 2016].
 2018], i.e., no exploration is performed in super-
                                                                        The transportation cost is usually defined as Euclidean
 vised training. In contrast, standard policy optimiza-
                                                                        distance, and OT distance is approximated by solving
 tion [Ranzato et al., 2016] basically allows the whole ex-
                                                                        a Kantorovich-Rubinstein dual [Gulrajani et al., 2017]
 ploration space. However, the exploration may become
                                                                        or a less-accurate lower bound [Kusner et al., 2015].
 inefficient since it may be too flexible, and some good
                                                                        Yurochkin et al. [2019] proposed a hierarchical OT
 sequences observed in history tend to be less explored
                                                                        representation for document, but the hierarchy was in
 and imitated due to the sparse rewards. Our proposed
                                                                        word- and topic-level based on the WMD. Our work
 WSIL aims to provide more efficient and systematic
                                                                        considers nested-Wasserstein distance, presenting an
 exploration. It allows the whole-space exploration, but
                                                                        efficient IPOT-based implementation for OT distance
 re-weights the exploration space to focus more on the
                                                                        approximation [Xie et al., 2018], successfully using it
 exploration that may provide better performance with
                                                                        to guide sequence generation.
 the Wasserstein trust-region.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

Self-Imitation Learning Experience replay has
been widely considered in RL. Deterministic policy
gradient [Silver et al., 2014, Lillicrap et al., 2016] per-
forms experience replay, but is limited to continuous
control. Actor-critic approaches [Konda and Tsitsiklis,
2000] can also utilize a replay buffer to improve per-
formance. Prioritized experience replay [Schaul et al.,
2015] samples trajectories based on the time-difference
                                                              Figure 4: Demonstration of nested-Wasserstein dis-
error, and we adopt it in our implementation. These ap-
                                                              tance in word-level (left) and sentence-level (right).
proaches indiscriminately buffer all experiences, while
the approach proposed here only buffers high-reward ex-
perience. Further, episodic control [Lengyel and Dayan,
2008] can be regarded as an extreme way of exploit-
ing past experience, trying to reproduce its best past
decisions, but retrieving states leads to poor efficiency
and generalization in testing. Self-imitation learning
was first applied in Atari games and Mujoco [Oh et al.,
2018, Gangwani et al., 2018], reporting performance
improvement w.r.t. sparse rewards. Compared with
that work, our solution considers a novel self-imitation
                                                              Figure 5: An example of image captioning. The right
learning scheme in the context of sequence generation.
                                                              generated sentence is better but given a lower CIDEr.
RL for Sequence Generation RL techniques have
been explored in detail for sequence generation. For ex-      Implementation Details A few key techniques are
ample, a Seq2Seq model can be trained by directly opti-       required for successful model training. (i ) The reward
mizing BLEU/ROUGE scores via policy gradient [Ran-            from a greedy-decoding sentence is used as the base-
zato et al., 2016, Bahdanau et al., 2017]. Furthermore,       line [Rennie et al., 2016] in conditional text generation;
Rennie et al. [2016] baselines the actor with the reward      in unconditional text generation, a constant baseline is
of a greedy-decoding sequence for the REINFORCE               used. (ii ) A single large replay buffer is maintained for
method. Model-based RL and hierarchical RL have               unconditional generation, and multiple replay buffers
also been studied for sequence generation [Zhang et al.,      are maintained for different conditions in conditional
2018a, Huang et al., 2019]. Further, a learned discrimi-      generation. (iii ) For each pair of sentences, the shorter
nator (or, critic) can also be used to provide sequence-      one should be padded to the same length as the longer
level guidance. By constructing different objectives,         one for a balanced optimal transport, which is a key
previous work [Yu et al., 2017, Lin et al., 2017, Guo         implementation technique.
et al., 2017, Fedus et al., 2018] combines the policy-
gradient algorithm with the original GAN training             Demonstration of nested-Wasserstein Figure 4
procedure. However, mode-collapse problems make the           shows the optimal matching in word-level (T) and
training of these methods challenging. By contrast, we        sentence-level (Ts ). It is interesting to see that all
propose the use of self-imitation learning, and maintain      similar words (e.g., bike and cycle) are matched with
a replay buffer to exploit past good explorations.            each other (higher weights), which cannot be achieved
                                                              via exact hard-matching metrics. At the distribution-
6    Experiments                                              level, we show an example in captioning tasks, where
                                                              we have five reference and hypothesis sentences. Tra-
We evaluate the proposed method on both uncondi-              ditional methods will match a hypothesis sentence to
tional and conditional text-generation tasks, consid-         each of the references and average over them; while our
ering standard benchmark datasets. Our approach               method performs distributional semantic matching, i.e.,
achieves state-of-the-art results on unconditional text       only matching similar references instead of all of them.
generation and video captioning. We also observed im-         For example, the third hypothesis is almost matched
proved performance on image captioning though relying         with the fifth reference, because they are more similar.
on much simpler features compared to prior state-of-          This is reasonable, because the references are usually
the-art methods. We also perform ablation studies to          very different, and equivalently matching with all of
understand the improvements brought by self-imitation         them is confusing for the generator. As shown in Fig-
and Wasserstein rewards individually. Details of the          ure 5, CIDEr focuses more on the locality fluency and
datasets, experimental setup and model architectures          equivalent matching with all references, while nested-
are provided in Appendix C.                                   Wasserstein performs distributional semantic matching.
                                                              More examples are provided in the Appendix.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Ruiyi Zhang1     Changyou Chen2        Zhe Gan3        Zheng Wen4       Wenlin Wang1       Lawrence Carin1

        Method                          Test-BLEU-2          3        4       5     Self-BLEU-2       3      4
        MLE [Caccia et al., 2018]                 0.902   0.706     0.470   0.392           0.787   0.646   0.485
        SeqGAN [Yu et al., 2017]                  0.820   0.604     0.361   0.211           0.807   0.577   0.278
        RankGAN [Lin et al., 2017]                0.852   0.637     0.389   0.248           0.822   0.592   0.230
        TextGAN [Zhang et al., 2017]              0.910   0.728     0.484   0.306           0.806   0.548   0.217
        FMGAN [Chen et al., 2018]                 0.911   0.782     0.584   0.382           0.834   0.643   0.405
        LeakGAN [Guo et al., 2017]                0.922   0.797     0.602   0.416           0.912   0.825   0.689
        WSIL-D (ours)                             0.917   0.774     0.576   0.393           0.797   0.569   0.284
        WSIL-I (ours)                             0.922   0.778     0.576   0.396           0.813   0.600   0.326
                       Table 2: Test-BLEU (↑) and Self-BLEU (↓) scores on Image COCO.
        Method                          Test-BLEU-2          3        4       5     Self-BLEU-2       3      4
        MLE [Caccia et al., 2018]                 0.905   0.701     0.464   0.278           0.764   0.522   0.295
        SeqGAN [Yu et al., 2017]                  0.630   0.354     0.164   0.087           0.728   0.411   0.139
        RankGAN [Lin et al., 2017]                0.723   0.440     0.210   0.107           0.672   0.346   0.119
        TextGAN [Zhang et al., 2017]              0.777   0.529     0.305   0.161           0.806   0.662   0.448
        FMGAN [Chen et al., 2018]                 0.913   0.751     0.512   0.315           0.830   0.682   0.427
        LeakGAN [Guo et al., 2017]                0.923   0.757     0.546   0.335           0.837   0.683   0.513
        SIL-D (ours)                              0.875   0.634     0.401   0.243           0.724   0.466   0.256
        SIL-I (ours)                              0.869   0.633     0.399   0.242           0.710   0.455   0.263
        WSIL-D (ours)                             0.931   0.736     0.503   0.317           0.795   0.553   0.299
        WSIL-I (ours)                             0.926   0.726     0.492   0.307           0.815   0.595   0.380
                Table 3: Test-BLEU (↑) and Self-BLEU (↓) scores on EMNLP2017 WMT News.

6.1   Unconditional Text Generation                              baselines obtain both low self-BLEU and test-BLEU
                                                                 scores, leading to more random generations.
We compare our approach with a number of related
RL-based GAN models for unconditional text gener-                Ablation Study We conduct ablation studies on
ation [Guo et al., 2017, Lin et al., 2017, Yu et al.,            EMNLP2017 WMT news to investigate the improve-
2017, Zhang et al., 2017]. Our implementation is devel-          ments brought by each part of WSIL. First, we test the
oped based on the LeakGAN model, by incorporating                benefits of using two types of self-imitation schemes.
Wasserstein self-imitation learning. All baseline exper-         We compare RL training with (i) self-imitation (SIL-D
iments are performed on the texygen platform [Zhu                and SIL-I), where only a replay buffer and conventional
et al., 2018]. The corpus-level BLEU score is employed           matching (features extracted from a neural network) are
to evaluate the generated sentences. Specifically, we            employed; and (ii) Wasserstein self-imitation (WSIL-D
follow the strategy in Yu et al. [2017], Guo et al. [2017]       and WSIL-I). Results are shown in Table 3. We ob-
and adopt the BLEU score, referenced by test set (test-          serve that the self-imitation strategy, with specific re-
BLEU) and themselves (self-BLEU) to evaluate the                 play buffer construction, can alleviate the discrepancies
quality of generated samples. Test-BLEU evaluates                between reward model bias and conventional rewards
the goodness of generated samples, and self-BLEU                 (e.g., self-BLEU). Without Wasserstein rewards, we
measures their diversity. The BLEU scores for 1000               achieve lower self-BLEU at the sacrifice of test-BLEU.
generated sentences are averaged to obtain the final             When combining with Wasserstein rewards, WSIL-D
score for each model. A good generator should achieve            and WSIL-I show superior performance relative to the
both a high test-BLEU score and a low self-BLEU score.           baselines. The random generated samples in Appendix
Following previous work [Guo et al., 2017], we test the          D and human evaluations further validate this.
proposed method on the short and long text genera-               Sweep the Temperature To better evaluate the
tion on Image COCO and EMNLP2017 WMT News                        proposed method, we follow Caccia et al. [2018] to
datasets. The BLEU scores with different methods are             evaluate the trade-off between the quality and diversity.
provided in Tables 2 and 3.                                      We use the F1-BLEU score as a metric, which consid-
Analysis Compared with other methods, LeakGAN,                   ers both quality and diversity, and is defined as the
WSIL-D and WSIL-I achieve comparable test-BLEU                   geometry average of BLEU score and 1− Self-BLEU:
scores, demonstrating high-quality generated sentences.                              2 × BLEU × (1-Self-BLEU)
However, LeakGAN tends to over-fit on training data,                   F1-BLEU =                              .       (9)
                                                                                       BLEU + (1-Self-BLEU)
leading to much higher (worse) self-BLEU scores. Our
proposed methods, by contrast, show good diversity of            Figure 6 indicates that WSIL is consistently better
the generated text with lower self-BLEU scores. Other            than the MLE model on the F1-BLEU-4 score.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

       Method                       BLEU-4                                                        METEOR                   ROUGE-L         CIDEr        Method                       BLEU-4   METEOR   ROUGE-L   CIDEr
       ED-LG [Yao et al., 2015]       35.2                                                          25.2                      -              -          S & T [Vinyals et al., 2015]   27.7     23.7      -        85.5
       SA-LSTM [Xu et al., 2016]      36.6                                                          25.9                      -              -          OT [Chen et al., 2019]         31.0     24.6      -        94.7
       SCST [Pasunuru et al., 2017]   40.5                                                          28.4                     61.4           51.7        Adaptive [Lu et al., 2017]     33.2     26.6      -       108.5
       MBP [Wang et al., 2018b]       41.3                                                          28.7                     61.7           48.0        TD [Anderson et al., 2017]     33.3     26.3     55.3     111.4
              Our Implementations                                                                                                                                      Our Implementations
       MLE                            39.2                                                                 27.8              59.8               46.6    MLE                            28.8    24.4      52.0     91.3
       MIXER [Ranzato et al., 2016]   40.2                                                                 27.9              60.8               50.3    MIXER [Ranzato et al., 2016]   30.8    24.7      52.9    101.2
       SCST [Rennie et al., 2016]     40.7                                                                 27.9              61.6               51.3    SCST [Rennie et al., 2016]    32.1     25.4      53.9    105.5
       WSIL-D                        42.5                                                                  29.0              62.4               52.1    WSIL-D                         31.8    25.7      54.0    107.4
       WSIL-I                         41.6                                                                 28.4              62.0               52.2    WSIL-I                         32.0    25.6      53.9    107.6

              Table 4: Video captioning results on MSR-VTT.                                                                                                 Table 5: Image captioning results on COCO.
                                                                                                                                                        jee and Lavie, 2005] scores. Results are summarized
  Methods                                     MLE               LeakGAN                                            SIL-D                 SIL-I
                                                                                                                                                        in Table 4. Consistent improvements are observed
  Human scores                              2.97±0.05            2.63±0.05                                        2.54±0.05         2.55±0.05
                                                                                                                                                        with the WSIL framework. WSIL-D performs slightly
  Methods                                     Real               WSIL-D                                           WSIL-I                   -            better than WSIL-I, both yielding much higher opti-
  Human scores                              4.11±0.04           3.49±0.05                                         3.41±0.05                -            mized CIDEr and METEOR scores than SCST. This
                          Table 6: Results of human evaluation.                                                                                         indicates that Wasserstein self-imitation can improve
            0.600                                                                                                                                       the semantic matching between generated sentences
                                                                                                0.53
                                                                                                             SCST
            0.575
                                                                                                                                                        and their references, while achieving reasonable exact-
                                                                      CIDEr on Validation Set

                                                                                                             WSIL-D
            0.550                                                                               0.52         WSIL-I
                                                                                                                                                        matching-based metric scores.
 F1-BLEU4

            0.525                                                                               0.51
            0.500                                                                               0.50
            0.475
                                                       MLE
                                                                                                0.49                                                    Image Captioning We consider image captioning
            0.450                                                                               0.48
            0.425
                                                       WSIL-D
                                                       WSIL-I                                   0.47
                                                                                                                                                        using the COCO dataset [Lin et al., 2014], which con-
            0.400
                    1.0     1.1       1.2    1.3       1.4      1.5                                    0      5       10     15     20     25      30   tains 123,287 images in total, each of which is annotated
                                  Reverse Temprature                                                                       Epochs
                                                                                                                                                        with at least 5 captions. Following with Karpathy’s
Figure 6: F1-BLEU-4 on sweeping temperature on
                                                                                                                                                        split [Karpathy and Fei-Fei, 2015], 113,287 images are
unconditional generation; CIDEr scores of Video Cap-
                                                                                                                                                        used for training and 5,000 images are used for vali-
tioning on validation set.
                                                                                                                                                        dation and testing. We follow the implementation of
                                                                                                                                                        the SCST approach [Rennie et al., 2016], and use ex-
Human Evaluation Simply relying on the above
                                                                                                                                                        tracted image tags [Gan et al., 2017] as image features
metrics is not sufficient to evaluate the proposed
                                                                                                                                                        (encoder). We report BLEU-k (k from 1 to 4) [Pap-
method [Caccia et al., 2018]. Following previous
                                                                                                                                                        ineni et al., 2002], CIDEr [Vedantam et al., 2015], and
work [Guo et al., 2017], we performed additional human
                                                                                                                                                        METEOR [Banerjee and Lavie, 2005] scores. Results
evaluation on the EMNNLP2017 WMT News dataset
                                                                                                                                                        are summarized in Table 5. Compared with the MLE
using Amazon Mechnical Turk. We require all the
                                                                                                                                                        baseline, RL-based methods significantly increase the
workers to be native English speakers, with approval
                                                                                                                                                        overall performance under all evaluation metrics. We
rate higher than 95% and at least 100 assignments
                                                                                                                                                        choose CIDEr as the optimizing metric, since it per-
completed. Previous work has shown higher scores of
                                                                                                                                                        forms best [Rennie et al., 2016]. Our proposed WSIL
LeakGAN compared with other baselines [Guo et al.,
                                                                                                                                                        shows improvement on most metrics compared with
2017], therefore we mainly focus on the comparison of
                                                                                                                                                        the SCST baseline. Examples of generated captions
our methods with LeakGAN. We randomly sampled
                                                                                                                                                        are provided in Appendix E.
200 sentences from each model, and asked 5 different
workers to score each sentence on a scale of 1 to 5,
considering its readability and meaning. Results are                                                                                                    7     Conclusions
shown in Table 6, which indicates better performance
of the proposed WSIL.                                                                                                                                   We have proposed a novel Wasserstein self-imitation
                                                                                                                                                        learning framework for sequence generation, to allevi-
                                                                                                                                                        ate the sparse-rewards problem of RL methods, and
6.2                       Conditional Text Generation                                                                                                   model-training bias imposed by conventional rewards.
Video Captioning We conduct experiments on the                                                                                                          This is done by encouraging self imitation and semantic
MSR-VTT dataset [Xu et al., 2016] for video caption-                                                                                                    matching in policy learning. Further, our method can
ing. The MSR-VTT is a large-scale video dataset,                                                                                                        be approximately interpreted as policy optimization
consisting of 20 video categories. The dataset was split                                                                                                with Wasserstein trust-regions. Experiments on uncon-
into 6513 and 3487 clips in the training and testing sets.                                                                                              ditional and conditional text generation demonstrate
Each video is annotated with about 20 captions. For                                                                                                     consistent performance improvement over strong base-
each video, we sample at 3 fps and extract Inception-                                                                                                   lines. For future work, the proposed method has the
v4 [Szegedy et al., 2017] features from these sampled                                                                                                   potential to be applied on other interesting sequence-
frames. We report BLEU-4 [Papineni et al., 2002],                                                                                                       generation tasks such as program synthesis [Liang et al.,
CIDEr [Vedantam et al., 2015], and METEOR [Baner-                                                                                                       2018].
Nested-Wasserstein Self-Imitation Learning for Sequence Generation
Ruiyi Zhang1    Changyou Chen2       Zhe Gan3       Zheng Wen4     Wenlin Wang1      Lawrence Carin1

Acknowledge The authors would like to thank the            Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,
anonymous reviewers for their insightful comments.           Kenneth Tran, Jianfeng Gao, Lawrence Carin, and
The research was supported in part by DARPA, DOE,            Li Deng. Semantic compositional networks for visual
NIH, NSF and ONR.                                            captioning. In CVPR, 2017.
                                                           Tanmay Gangwani, Qiang Liu, and Jian Peng. Learn-
References                                                   ing self-imitating diverse policies. arXiv:1805.10309,
                                                             2018.
Peter Anderson, Xiaodong He, Chris Buehler, Damien
                                                           Aude Genevay, Gabriel Peyré, and Marco Cuturi.
  Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
                                                            Learning generative models with sinkhorn diver-
  Bottom-up and top-down attention for image cap-
                                                            gences. In AISTATS, 2018.
  tioning and vqa. In CVPR, 2017.
                                                           Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, and
Martin Arjovsky, Soumith Chintala, and Léon Bot-
                                                             Sunghun Kim. Dialogwae: Multimodal response gen-
 tou. Wasserstein generative adversarial networks. In
                                                             eration with conditional wasserstein auto-encoder.
 ICML, 2017.
                                                             In ICLR, 2019.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-           Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-
  gio. Neural machine translation by jointly learning        cent Dumoulin, and Aaron C Courville. Improved
  to align and translate. In ICLR, 2015.                     training of Wasserstein GANs. In NeurIPS, 2017.
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,              Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu,
 Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron              and Jun Wang. Long text generation via adversarial
 Courville, and Yoshua Bengio. An actor-critic algo-         training with leaked information. In AAAI, 2017.
 rithm for sequence prediction. In ICLR, 2017.
                                                           Junxian He, Daniel Spokoyny, Graham Neubig, and
Satanjeev Banerjee and Alon Lavie. Meteor: An auto-          Taylor Berg-Kirkpatrick. Lagging inference networks
  matic metric for mt evaluation with improved cor-          and posterior collapse in variational autoencoders.
  relation with human judgments. In ACL Workshop,            In ICLR, 2019.
  2005.
                                                           Sepp Hochreiter and Jürgen Schmidhuber. Long short-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam         term memory. Neural computation, 1997.
  Shazeer. Scheduled sampling for sequence prediction      Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng
  with recurrent neural networks. In NeurIPS, 2015.          Gao, and Graham Neubig. What makes a good story?
Massimo Caccia, Lucas Caccia, William Fedus, Hugo            designing composite rewards for visual storytelling.
 Larochelle, Joelle Pineau, and Laurent Charlin. Lan-        arXiv preprint arXiv:1909.05316, 2019.
 guage gans falling short. arXiv:1811.02549, 2018.         Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao               Salakhutdinov, and Eric P Xing. Controllable text
  Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang,                 generation. In ICML, 2017.
  Guoyin Wang, Ruiyi Zhang, and Lawrence Carin.            Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei
  Adversarial text generation via feature-mover’s dis-      Sha, and Kilian Q Weinberger. Supervised word
  tance. In NeurIPS, 2018.                                  mover’s distance. In NeurIPS, 2016.
Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao,        Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng
  Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen,              Wu, Jianfeng Wang, and Xiaodong He. Hierarchi-
  Changyou Chen, and Lawrence Carin. Improving               cally structured reinforcement learning for topically
  sequence-to-sequence learning via optimal transport.       coherent visual story generation. In AAAI, 2019.
  In ICLR, 2019.                                           Ferenc Huszár. How (not) to train your generative
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-             model: Scheduled sampling, likelihood, adversary?
 cehre, Dzmitry Bahdanau, Fethi Bougares, Holger             arXiv preprint arXiv:1511.05101, 2015.
 Schwenk, and Yoshua Bengio. Learning phrase repre-        Andrej Karpathy and Li Fei-Fei. Deep visual-semantic
 sentations using rnn encoder-decoder for statistical       alignments for generating image descriptions. In
 machine translation. In EMNLP, 2014.                       CVPR, 2015.
Marco Cuturi. Sinkhorn distances: Lightspeed compu-        Vijay R Konda and John N Tsitsiklis. Actor-critic
 tation of optimal transport. In NeurIPS, 2013.              algorithms. In NeurIPS, 2000.
William Fedus, Ian Goodfellow, and Andrew M Dai.           Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
 Maskgan: Better text generation via filling in the _.      Weinberger. From word embeddings to document
 ICLR, 2018.                                                distances. In ICML, 2015.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

Máté Lengyel and Peter Dayan. Hippocampal con-           Tim Salimans, Han Zhang, Alec Radford, and Dimitris
 tributions to control: the third way. In NeurIPS,         Metaxas. Improving GANs using optimal transport.
 2008.                                                     In ICLR, 2018.
Chen Liang, Mohammad Norouzi, Jonathan Berant,           Tom Schaul, John Quan, Ioannis Antonoglou, and
 Quoc Le, and Ni Lao. Memory augmented policy op-          David Silver. Prioritized experience replay. In ICLR,
 timization for program synthesis with generalization.     2015.
 In NeurIPS, 2018.                                       John Schulman, Sergey Levine, Pieter Abbeel, Michael
Timothy P Lillicrap, Jonathan J Hunt, Alexander            Jordan, and Philipp Moritz. Trust region policy
  Pritzel, et al. Continuous control with deep rein-       optimization. In ICML, 2015.
  forcement learning. In ICLR, 2016.                     David Silver, Guy Lever, Nicolas Heess, Thomas Degris,
Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,         Daan Wierstra, and Martin Riedmiller. Deterministic
 and Ming-Ting Sun. Adversarial ranking for language       policy gradient algorithms. In ICML, 2014.
 generation. In NeurIPS, 2017.                           Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Se-
Tsung-Yi Lin, Michael Maire, Serge Belongie, James          quence to sequence learning with neural networks.
  Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,          In NeurIPS, 2014.
  and C Lawrence Zitnick. Microsoft coco: Common         Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,
  objects in context. In ECCV, 2014.                      and Alexander A Alemi. Inception-v4, inception-
                                                          resnet and the impact of residual connections on
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,
                                                          learning. In AAAI, 2017.
  and Kevin Murphy. Improved image captioning via
  policy gradient optimization of spider. In ICCV,       Bowen Tan, Zhiting Hu, Zichao Yang, Ruslan Salakhut-
  2017.                                                    dinov, and Eric Xing. Connecting the dots between
                                                           mle and rl for sequence generation. arXiv:1811.09740,
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard
                                                           2018.
  Socher. Knowing when to look: Adaptive attention
  via a visual sentinel for image captioning. In CVPR,   Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
  2017.                                                    Parikh. Cider: Consensus-based image description
                                                           evaluation. In CVPR, 2015.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
  Christian Puhrsch, and Armand Joulin. Advances         Cédric Villani. Optimal transport: old and new.
  in pre-training distributed word representations. In     Springer Science & Business Media, 2008.
  LREC, 2018.                                            Oriol Vinyals, Alexander Toshev, Samy Bengio, and
                                                           Dumitru Erhan. Show and tell: A neural image
Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak
                                                           caption generator. In CVPR, 2015.
  Lee. Self-imitation learning. In ICML, 2018.
                                                         Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
                                                          Guoyin Wang, Dinghan Shen, Changyou Chen, and
  Jing Zhu. Bleu: a method for automatic evaluation
                                                          Lawrence Carin. Topic-guided variational autoen-
  of machine translation. In ACL, 2002.
                                                          coders for text generation. In NAACL, 2019.
Ramakanth Pasunuru, Mohit Bansal, and Mohit Bansal.      Xin Wang, Wenhu Chen, Yuan-Fang Wang, and
  Reinforced video captioning with entailment rewards.     William Yang Wang. No metrics are perfect: Ad-
  In NAACL, 2017.                                          versarial reward learning for visual storytelling. In
Romain Paulus, Caiming Xiong, and Richard Socher. A        ACL, 2018a.
  deep reinforced model for abstractive summarization.   Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang,
  In ICLR, 2017.                                           and William Yang Wang. Video captioning via hier-
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,          archical reinforcement learning. In CVPR, 2018b.
 and Wojciech Zaremba. Sequence level training with      Sam Wiseman and Alexander M Rush. Sequence-to-
 recurrent neural networks. In ICLR, 2016.                 sequence learning as beam-search optimization. In
Steven J Rennie, Etienne Marcheret, Youssef Mroueh,        EMNLP, 2016.
  Jarret Ross, and Vaibhava Goel. Self-critical se-      Yujia Xie, Xiangfeng Wang, Ruijia Wang, and
  quence training for image captioning. In CVPR,           Hongyuan Zha. A fast proximal point method for
  2016.                                                   Wasserstein distance. In arXiv:1802.04307, 2018.
Alexander M Rush, Sumit Chopra, and Jason Weston.        Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt:
  A neural attention model for abstractive sentence        A large video description dataset for bridging video
  summarization. arXiv:1509.00685, 2015.                   and language. In CVPR, 2016.
Ruiyi Zhang1     Changyou Chen2       Zhe Gan3       Zheng Wen4   Wenlin Wang1   Lawrence Carin1

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
 Aaron C Courville, Ruslan Salakhutdinov, Richard S
 Zemel, and Yoshua Bengio. Show, attend and tell:
 Neural image caption generation with visual atten-
 tion. In ICML, 2015.
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Bal-
  las, Christopher Pal, Hugo Larochelle, and Aaron
  Courville. Describing videos by exploiting temporal
  structure. In CVPR, 2015.
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
  Seqgan: Sequence generative adversarial nets with
  policy gradient. In AAAI, 2017.
Mikhail Yurochkin, Sebastian Claici, Edward Chien,
 Farzaneh Mirzazadeh, and Justin Solomon. Hierar-
 chical optimal transport for document representation.
 In NeurIPS, 2019.
Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang,
  Liqun Chen, Dinghan Shen, Guoyin Wang, and
  Lawrence Carin. Sequence generation with guider
  network. arXiv preprint arXiv:1811.00696, 2018a.
Ruiyi Zhang, Changyou Chen, Chunyuan Li, and
 Lawrence Carin. Policy optimization as wasserstein
 gradient flows. In ICML, 2018b.
Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo
  Henao, Dinghan Shen, and Lawrence Carin. Adver-
  sarial feature matching for text generation. In ICML,
  2017.
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan
  Zhang, Jun Wang, and Yong Yu. Texygen: A bench-
  marking platform for text generation models. In
  SIGIR, 2018.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation

       Supplementary Material of “Nested-Wasserstein Self-Imitation
                  Learning for Sequence Generation”

A     More Details about WSIL                                              Algorithm 2 IPOT for Wasserstein Rewards
                                                                                                                    0    0 m
Direct nested-Wasserstein Self-Imitation Learn-                             1: Input: Feature vectors µ = {zi }n
                                                                                                               1 , ν = {zj }1   and
ing Direct Wasserstein self-imitation learning (WSIL-                          generalized stepsize 1/λ,
                                                                            2: σ = m1
                                                                                      1m , T(1) = 1n 1m >
D) weights the original rewards with outputs from the                                                   Cij
behavior policy for sequences in the replay buffer B.                       3:   Cij = c(zi , zj0 ), Aij = e− λ
The sequences from the replay buffer are directly used                      4:   for t = 1, 2, 3 . . . do
as pseudo-samples to update the generator                                   5:     Q = A T(t) // is Hadamard product
                                         P [Liang et al.,                   6:     for k = 1, . . . K do
2018]. Similarly, define rns (Y s , Y ) , j Tjs rs (Y s , Yj ),             7:        δ = nQσ1
                                                                                                 , σ = mQ1> δ
with T0 = {Tjs } the optimal weights. to be the                             8:     end for
nested-Wasserstein reward between the sequence Y s                          9:     T(t+1) = diag(δ)Qdiag(σ)
and ground-truth references Y . The general objective                      10:   end for
(6) is then extended to be the objective for WSIL-D, as                    11:   Return hT, 1 − Ci

    JD (πθ ) , EY s ∼πθ,X [r(Y s )]                                 (10)
                                                                           i) For unconditional generation with synthetic data,
             + λEY b ∼πB,X rns (Y b , Y )πθ (Y b ) ,
                                                 
                                                                    (11)   following Chen et al. [2018], we adopt the negative
                                                                           log-likelihood (NLL) to measure model performance,
where r is the original RL reward; rns is the nested-
                                                                           as there exists an oracle data distribution. For this
Wasserstein reward. Based on the objective of (11), we
                                                                           experiment, the replay buffer is constructed by gener-
update the generator with standard RL loss and the
                                                                           ated sentences which achieved higher reward from the
self-imitation loss alternatively, with a hyperparameter
                                                                           learned discriminator.
λ that controls the update frequency:
                  K                                                        ii) For unconditional generation with real data, since
                  X
∇θ JD (πθ ) ≈ −         [(r(Yks )   −   b) ∇θ log πθ (Yks )]               we will use Test BLEU score and Self BLEU score for
                  k=1                                                      evaluating generated sentences, we maintain a single
             0                                                      (12)   large replay buffer with BLEU-F1 score as the selection
           K h                                                  i
                                                                           criteria to evaluate quality and diversity trade-off Gu
           X
                  rns (Yjb , Y          bs + ∇θ log πθ (Yjb )
                                          
      −λ                         )−
          j=1                                                              et al. [2019]. F1-BLEU score is defined as the geometry
where  (·)+ = max(·, 0) and bs and b are the baselines to                  average of BLEU score and 1− Self-BLEU
reduce the variance of gradient estimates. In practice,
(·)+ means that WSIL-D only imitates the sequences in                                        2 × BLEU × (1-Self-BLEU)
the replay buffer with the higher rewards. Intuitively,                          F1-BLEU =                            .         (13)
                                                                                               BLEU + (1-Self-BLEU)
direct self-imitation implicitly imposes larger weights
on good simulated data for training, to exploit good                       iii) For conditional generation with captioning task, we
historical explorations. The main difference between                       maintain a small (K 0 = 5 sequences) replay buffer for
WSIL-D and its indirect counterpart is that sequences                      each conditional input; the replay buffer seems large,
from the replay buffer are not used to compute the                         but we only need to store sequences of indexes, which
self-imitation rewards, but used to evaluate the policy.                   is very efficient. Here we use the nested Wasserstein
Intuitively, WSIL-D changes the data distribution to                       rewards as the metric.
explore the good history more efficiently.
                                                                           iv) For conditional generation with non-parallel style
                                                                           transfer, we maintain a large replay buffer storing suc-
B     Implementation Details                                               cessfully transferred pairs, and we define a metric which
                                                                           considers both the accuracy and content preservation:
Replay Buffer Construction In our algorithm, a                             p(Right Style)× BLEU.
metric is required to be designed to select high-reward
history demonstrations, which will be stored in the                        Balance between RL and self-imitation Accord-
replay buffer D. There are different ways for evaluating                   ing to the theory of Wasserstein policy gradient Villani
sentences:                                                                 [2008], 1/λ defined in Section (6) can be interpreted
Ruiyi Zhang1       Changyou Chen2      Zhe Gan3       Zheng Wen4     Wenlin Wang1      Lawrence Carin1

as generalized decaying learning rate. With more ex-          C    Experimental Setup
plorations, λ becomes larger, and the algorithm should
focus more on the self-imitated learning. In practice, we     Conditional text generation We consider image
do one self-imitated learning update with every 10 RL         captioning using the COCO dataset Lin et al. [2014],
training updates, and as training proceeds, we increase       which contains 123,287 images in total, each of which
the frequency of self-imitation, and finally update the       is annotated with at least 5 captions. Following Karpa-
generator with one-step self-imitation followed with          thy’s split Karpathy and Fei-Fei [2015], 113,287 images
one-step standard RL training.                                are used for training and 5,000 images are used for
                                                              validation and testing. We follow the implementation
The trick of soft-argmax Recall that in sequence              of the SCST approach [Rennie et al., 2016], and use
generation, one first samples a token based on the            extracted image tags [Gan et al., 2017, Wang et al.,
policy, then feeds its token embedding into the RNN           2019] as image features (encoder). The learning rate
to compute the logits of the next token, and repeat the       of the generator is 0.0002, the maximum length of se-
above process based on the logits again until the stop        quence is set to 25. For video captioning, the learning
token is generated. Instead of using the embedding            rate of the generator is 0.0001, the maximum length of
of a sampled token, the soft-argmax trick feeds the           sequence is set to 30. We use fixed image features and
RNN with the weighted average of the embeddings of            do not finetune the image encoder following previous
most-likely tokens. In particular, let E be the word          work. A one-layer LSTM with 1024 units is used as
embedding matrix, gt be the logits under the current          the decoder. The word-embedding dimension is set to
policy and st be the hidden state of the policy πθ . With     512.
the soft-argmax trick, the state vector is updated by
                                                              Unconditional text generation We use the
                ỹt = E · softmax(gt /β),           (14)      COCO dataset Lin et al. [2014], in which most sentences
                                                              are of length about 10. Since we consider unconditional
                s̃t = h(s̃t−1 , e(ỹt )) ,          (15)      text generation, only image captions are used as the
                                                              training data. After preprocessing, the training dataset
where 0 < β < 1 is the annealing factor, and in practice,     consists of 27,842 words and 417,126 sentences. We use
we set β = 0.01.                                              120,000 random sample sentences as the training set,
                                                              and 10,000 as the test set. For the COCO dataset, the
Discriminator implementation In unconditional                 learning rate of the generator is 0.0002, the learning
generation, instead of using policy gradient and the          rate of the manager is 0.0002 (we follow the LeakGAN
output of the discriminator as rewards, we use the soft-      work), and the maximum length of sequence is set to
argmax trick Hu et al. [2017]. Since the policy gradient      25.
is not stable enough and soft-argmax trick gives us
                                                              Following Zhu et al. [2018], we use the News section in
better performance (See our extensive experiments).
                                                              the EMNLP2017 WMT4 Dataset as our training data,
                                                              which consists of 646,459 words and 397,726 sentences.
Nested-Wasserstein rewards implementation                     After preprocessing, the training dataset contains 5,728
In conditional generation, the Wasserstein rewards is         words and 278,686 sentences. The learning rate of the
implemented based on COCO test tools, and we use              generator is 0.0002, the learning rate of the manager
the fasttext Mikolov et al. [2018] as the fixed word          is 0.0002, and the maximum length of sequence is set
embedding to compute the reward. In practice, we use          to 50. The number of hidden units used in both the
K = 5 with a hyper-parameter search from {3, 5, 8, 10}.       LSTM for the generator and the manager are set to
We will release this code, which is easy to use as other      128. The dimension of the word embedding is 300. The
metrics. For unconditional generation, we use the fixed       discriminator is a CNN with its structure specified in
learned word embedding via stop its gradient, where           Table 7.
the embedding and the Wasserstein trust region are
jointly optimized.
                                                              Settings of human evaluation We perform human
We conduct experiments on synthetic data similar to           evaluation using Amazon Mechanical Turk, evaluating
Yu et al. [2017], where our implementation is based           the text quality based on readability and meaning-
on LeakGAN. The result is shown in Figure 3, where            fulness (whether sentences make sense). We ask the
WSIL-I and WSIL-D show better performance than                worker to rate the input sentence with scores scaling
LeakGAN. Specifically, LeakGAN is not stable in the           from 1 to 5, with criterion listed in Table C. We re-
training and the Negative log-likelihood increases after      quire all the workers to be native English speakers,
150 epochs. Compared with LeakGAN, WSIL-I and                 with approval rate higher than 95% and at least 100
WSIL-D are more stable.                                       assignments completed.
You can also read