OptAGAN: Entropy-based finetuning on text VAE-GAN

Page created by Steven Barnes
 
CONTINUE READING
OptAGAN: Entropy-based finetuning on text VAE-GAN

                                                           Paolo Tirotta                                   Stefano Lodi
                                                      Department of Statistics                     Department of Computer Science
                                                       Alma Mater Studiorum                            Alma Mater Studiorum
                                                       University of Bologna                           University of Bologna
                                                   paolo.tirotta@gmail.com                          stefano.lodi@unibo.it

                                                              Abstract                           time it takes as input its own generated sequence
                                                                                                 of words. This dissimilarity leads to the so-called
                                            Transfer learning through large pre-trained
                                            models has changed the landscape of cur-
                                                                                                 exposure bias (Bengio et al., 2015), where the ac-
                                            rent applications in natural language process-       cumulation of errors during inference can produce
arXiv:2109.00239v1 [cs.CL] 1 Sep 2021

                                            ing (NLP). Recently Optimus, a variational           poor outputs. Furthermore, the loss function of
                                            autoencoder (VAE) which combines two pre-            MLE is very strict. For each sequence, only the to-
                                            trained models, BERT and GPT-2, has been             ken accounted by the training sample is considered
                                            released, and its combination with generative        as correct (Press et al., 2017), and the model learns
                                            adversial networks (GANs) has been shown             precisely to mimic the given samples, often leading
                                            to produce novel, yet very human-looking
                                                                                                 to quite dull and homogeneous outputs.
                                            text. The Optimus and GANs combination
                                            avoids the troublesome application of GANs              An alternative to MLE methods are generative
                                            to the discrete domain of text, and prevents         adversial networks (GANs) (Goodfellow et al.,
                                            the exposure bias of standard maximum like-          2014), where a generator learns to create outputs
                                            lihood methods. We combine the training              that can fool a discriminator into believing they
                                            of GANs in the latent space, with the fine-          are real. Thus, GANs do not have the strict loss
                                            tuning of the decoder of Optimus for sin-
                                                                                                 function of MLE, and do not suffer from expo-
                                            gle word generation. This approach lets us
                                            model both the high-level features of the sen-       sure bias, as they learn to sample during training.
                                            tences, and the low-level word-by-word gener-        Nonetheless, the application of GANs to the text
                                            ation. We finetune using reinforcement learn-        realm has been rather complicated. Due to the dis-
                                            ing (RL) by exploiting the structure of GPT-         creteness of text, the sampling of each token results
                                            2 and by adding entropy-based intrinsically          in a non-differentiable function, which does not al-
                                            motivated rewards to balance between quality         low to backpropagate the loss of the discriminator.
                                            and diversity. We benchmark the results of
                                                                                                 Countermeasures include the use of reinforcement
                                            the VAE-GAN model, and show the improve-
                                            ments brought by our RL finetuning on three
                                                                                                 learning (RL) (Yu et al., 2016; Lin et al., 2017; Guo
                                            widely used datasets for text generation, with       et al., 2017; de Masson d’Autume et al., 2020), the
                                            results that greatly surpass the current state-of-   use of the Gumbel-Softmax relaxation (Kusner and
                                            the-art for the quality of the generated texts.      Hernández-Lobato, 2016; Nie et al., 2019), or to
                                                                                                 avoid the discrete space altogether and work with
                                        1   Introduction                                         continuous embeddings using autoencoders (Zhao
                                        Unsupervised text generation finds its use on a          et al., 2017; Donahue and Rumshisky, 2018; Haidar
                                        plethora of real-world application, ranging from         et al., 2019). However, methods which utilize RL
                                        machine translation (Wu et al., 2016), to summa-         often rely on MLE pre-training, and usually do not
                                        rization (Allahyari et al., 2017) and dialogue gen-      improve over them (Caccia et al., 2020). Instead,
                                        eration (Li et al., 2016). A general approach to         for both the approaches using the Gumbel-Softmax
                                        modelling text sequences is to autoregressively gen-     distribution, and even more so for autoencoders,
                                        erate the next token given the previous ones, and        the discriminator considers a continuous represen-
                                        the most successful and widespread technique is to       tation of text, so it is not able to judge effectively
                                        train a model using maximum likelihood estima-           the single word-by-word generation.
                                        tion (MLE). This approach, however, is not without          In the past few years, natural language process-
                                        fault. At training time the model learns to generate     ing (NLP) applications have found huge improve-
                                        a token given the ground truth, while at inference       ments with the introduction of the attention mecha-
nism and the transformer architecture, with notable       conditional text generation, and also provide results
examples of BERT, GPT-2 and GPT-3 among oth-              for the conditional review dataset YELP.
ers (Vaswani et al., 2017; Devlin et al., 2018; Rad-         Results show that the base VAE-GAN model
ford et al., 2019; Brown et al., 2020). These kind        already improves over other GAN methods, espe-
of language models are large deep neural networks         cially with regards to quality. OptAGAN, further
that are able to understand the depedencies between       improves over these results, and manages to handle
words thanks to attention and are trained over huge       the quality-diversity tradeoff very well. Moreover,
amounts of unannotated data. As such, pre-trained         we show a further experiment that helps understand-
language models provide better language under-            ing the strenghts and weaknesses of our finetuning
standing over recurrent neural networks, can be           approach.
very easily finetuned on a downstream task, includ-
ing text generation, and reached state-of-the-art         2     Background
results in many areas. Recently Optimus, a text
variational autoencoder (VAE), that is an autoen-         In this section we introduce the mathematical nota-
coder which maps sentences to a meaningful la-            tion and briefly describe the main theoretical tools
tent space, has been proposed (Li et al., 2020). It       which are used in OptAGAN. We also present an
combines both BERT and GPT-2, as encoder and              overview of the other methods of text generation.
decoder respectively, and can be employed both
                                                          2.1    Variational autoencoders
as a generative model, and as a tool for language
understanding tasks.                                      VAEs are generative models formed by two inde-
   In this work, we aim to benchmark the results          pendent models, an encoder qφ and a decoder pθ .
obtained from combining Optimus and GANs, sim-            The encoder is tasked with mapping the input x
ilarly as indicated in the original paper. In doing so,   to a latent space z that allows for interpolation.
we also investigate the GAN structure and compare         The decoder maps from z → x̃, providing an ap-
the adaptive update strategy presented in (Ouyang         proximation of the original input. Thanks to the
and Agam, 2020) with the standard update strategy         introduction of a local variation from sampling the
of GANs. Furthermore, we combine the training in          encoder output, it is possible to induce a smooth
the continuous space, with the finetuning of the de-      latent representation of the inputs, which differs
coder of Optimus in the discrete text space, in a sim-    from the rigid space of autoencoders.
ilar fashion as done in ConCreteGAN (Kim et al.,
                                                          Optimus Optimus combines the autoregressive
2020). However, differently from most approaches
                                                          nature of the GPT-2 text generation with the latent
which use RL, we do not use REINFORCE, but
                                                          produced by the encoder, such that text generation
add an additional value head to GPT-2, which out-
                                                          is done as:
puts the intermediate rewards (Ziegler et al., 2019).
Moreover, we modify the reward function by con-                               n
                                                                              Y
sidering the entropy of the model when generating                pθ (x|z) =         pθ (xt |x1 , ..., xt−1 , z),   (1)
tokens, and favour diversity in the output by adding                          t=1
an intrinsic reward.
                                                          where the probability of each token is estimated
   Thus, our model OptAGAN 1 is able to model             conditionally on the latent embedding and the pre-
both the higher level sentence structure, and has         vious tokens. The latent vector z, which comes
more control over single word generation, in a way        from the output of the BERT encoder, controls the
that favours both quality and diversity for the gen-      high-level characteristics of the sentence, such as
erated sentences. We measure such criteria us-            length, tense, style and topic, and allows for the
ing standard automatic metrics: BLEU for qual-            guided generation of text.
ity, Backwards-BLEU for diversity, and Fréchet
distance, on which we also present further analy-         2.2    Generative adversarial networks
sis. We consider the image caption dataset COCO,
the Stanford Natural Language Inference (SNLI)            GANs are also generative models formed by two
dataset, and the EMNLP News 2017 dataset for un-          models: a generator and a discriminator. Differ-
                                                          ently from VAEs, the generator G samples from a
   1
     Opt(imus) A(ugmented) GAN - Implementation can be    random variable to produce output that can fool the
found at https://github.com/Egojr/optagan                 discriminator D into believing they are real, while
the discriminator is constantly learning to distin-      form of the discriminator, and require MLE pre-
guish between the real and generated data. The           training followed by adversarial training with RE-
objective of the two models can formulated as:           INFORCE. ScratchGAN (de Masson d’Autume
                                                         et al., 2020) is the first model to show that MLE
   min maxV (D, G) = Ex∼pD [log(D(x))]                   pre-training can be avoided by carefully combin-
      G    D                                       (2)   ing existing techniques. Other works that train
               + E∼p [log(1 − D(G())],
                                                         GANs using continuous relaxations include ARAE
                                                         (Zhao et al., 2017) and LATEXT-GAN (Haidar
  where pD and p are the distribution of the data
                                                         et al., 2019), which use autoencoders to learn a
and of the input noise of the generator, respectively.
                                                         continuous latent representation. Models based on
When combining GANs with autoregressive text
                                                         the Gumbel-Softmax distribution are RelGAN (Nie
generation, the operation of sampling the next to-
                                                         et al., 2019) and GSGAN (Kusner and Hernández-
kens is non-differentiable, so the application of
                                                         Lobato, 2016). On the comparison of these meth-
GANs relies on either the use of policy gradient al-
                                                         ods and the evaluation metrics, (Caccia et al., 2020;
gorithms, or the use of continuous approximations,
                                                         Semeniuta et al., 2019) have shown the inadequacy
such as the combination of GANs with autoen-
                                                         of current GANs when compared to MLE and the
coders or VAEs.
                                                         need for metrics that can better measure the quality
2.3    Reinforcement learning                            and diversity of the models.
                                                         On the topic of exploration of text GAN models and
Approaches that use policy gradient algorithms to
                                                         RL are ColdGANs (Scialom et al., 2020), which
allow the training of GANs consider the genera-
                                                         delve deeper into the effects of temperature for the
tor as the policy πθ to train, the sampling of the
                                                         text generation. An approach similar to ours, which
token as an action A from a state S, and the out-
                                                         involves the use of large pre-trained models and RL
put of the discriminator as the reward R. For each
                                                         is TextGAIL (Wu et al., 2021), where both GPT-2
sentence generated by a model a reward is calcu-
                                                         as the generator, and RoBERTa as the discriminator
lated, however, as the discriminator only calculates
                                                         are used for the task of text generation.
the reward over finished sentences, the intermedi-
                                                         Regarding text VAEs, Optimus (Li et al., 2020)
ate rewards, usually referring to the word-by-word
                                                         is the first large pre-trained model of such kind,
generation, are obtained through the REINFORCE
                                                         whereas previously researchers had tried develop-
algorithm and Monte-Carlo rollout. Optimization
                                                         ping VAEs using either recurrent neural networks
of the parameters is performed through gradient
                                                         (Bowman et al., 2015b) or semi-amortized infer-
ascent:
                                                         ence (Kim et al., 2018).
               θ t+1 = θ t + α∇θ J(θ t )           (3)   3    OptAGAN
      ∇θ J(θ t ) ∝ Eπ [Gt ∇θ ln π(At |St , θ)] ,   (4)
                                                         The architecture that we present in this work is
where the gradient of the REINFORCE objective,           composed of three main processes, as can be seen
∇θ J(θP                                                  from Figure 1. Each process is independent of the
        t ) is proportional to the discounted returns
Gt =        t     j                                      others, so each part is trained sequentially.
            j=0 γ Rj , so that higher return actions
are favored, and is inversely proportional to the
                                                             • In order to fully utilize the strengths of Op-
probability of being selected, so that higher prob-
                                                               timus, we finetune both the encoder and the
ability actions are not at an advantage compared
                                                               decoder on the target dataset. The end results
to low probability ones. Equation 4 also provides
                                                               are a more separated and distinct latent space
a value that can be sampled at each time step and
                                                               for each sentence, and a decoder which better
only depends on the policy π.
                                                               reconstructs the original sentences.
2.4    Related work                                          • Next, we train the GAN model composed of
Many works have dealt with the training of GANs                the generator and the discriminator. In the
in the discrete realm, starting with SeqGAN (Yu                case of conditional generation, we also add
et al., 2016), LeakGAN (Guo et al., 2017) and                  a classifier, whose loss is then passed to the
RankGAN (Lin et al., 2017), where all of them                  generator. Both the generator and the discrim-
share a similar structure, mostly differing in the             inator only consider the continuous latent em-
Discriminator
         Encoder                                                                 x = Original text
                              Decoder
   x                    z                   x̃
                                                                                 x̃ = Reconstructed text

                                                                                 x̂ = Generated text
                                                                                 z = Latent variable
        Generator              Decoder              BLEU
                       ẑ                  x̂                   R
                                                                                 ẑ = Generated latent variable

                                                                                   = Noise

                Real         Fake                                                R = Rewards
Figure 1: Structure of our proposed model OptAGAN. In red is the main process to generate new text, while in
black we show the other models taking part in the training of OptAGAN. Finally, in blue is the last bit of the RL
finetuning process.

       beddings, so they are much lighter and faster      or the generator is given by a comparison of the
       to train compared to other text GANs.              loss change ratio of the two networks.
                                                                 |(LcG − LpG )|        |(LcD − LpD )|
   • Finally, we finetune the decoder on discrete           rG =                , rD =                , (6)
     text using a value head, which estimates the                   LpG + c               LpD + c
     reward of each single token in a sentence for        where the relative change between the current loss
     the generated sequences. The estimated re-           Lc and previous loss Lp for both networks is used
     wards are also augmented by considering the          in determining which one gets updated. We also
     model entropy of each generated token. The           add an arbitrarily small constant c in case the losses
     gradient is then passed to the decoder through       are too close to 0. A weight λ can be also intro-
     simple policy gradient.                              duced, so that if rD > λrG the discriminator is
                                                          updated, and viceversa. Contrarily to the original
The structure of both the generator and the dis-          paper, which suggests a value λ ≥ 1, we notice
criminator is a simple feed-forward neural network.       that at the beginning of training there is a stark
Current literature does not give clear answers about      imbalance in the number of updates between the
which loss function specification for GANs is best        networks resulting in slower convergence. To bet-
for continuous data such as text latent embeddings.       ter balance the training, we end up using a value
Our experiments show that Wasserstein GANs with           of λ < 1, which converges to 1 with each passing
gradient penalty (WGAN-GP) perform the best               epoch of training.
over sliced Wasserstein distances, which produce
very homogeneous outputs. Thus, the loss function         3.2   Value head
to optimize for the generator and discriminator is:       Due to the high computational costs of implement-
                                                          ing a text discriminator with a vocabulary of size
  min max Ex∼pD [D(x)] − E∼p() [D(G())]                equal to GPT-2, we rely on an external value head,
   G   D
      + λEx̂∼p(x̂) k(Ox̂ D(x̂)k2 − 1)2 .                  whose scalar output for each token corresponds to
                                     

                                         (5)              the intermediate reward. Let the hidden states of
                                                          the decoder ρ and the value head ν, the rewards
3.1    Update strategy                                    calculated from an external metric Rex and each
When training GANs, the common update strat-              state St are:
egy is to have k update steps, usually in the range          Rt = ν(St |ρ), with ν(St |ρ) minimizing
[5, 10], for the discriminator for each step of the                              T
                                                                                 X                           (7)
generator, as it has been shown to give stable train-                              (ν(St |ρ)) − Rex .
ing for GANs. We indeed also use this update                                     t=0
method, but we also experiment with an adaptive              The external head takes as input the frozen hid-
update strategy proposed in (Ouyang and Agam,             den states of GPT-2. Freezing the hidden parame-
2020), where the choice to update the discriminator       ters is necessary because we are not modelling the
parameters of the VAE model during RL. The loss              4.1      Evaluation metrics
is then calculated according to a MAE objective              BLEU is a metric that measures the overlapping
and passed back to the value head. This process              n-grams between a hypothesis text and all the ref-
estimates how much each token contributes to the             erence texts. The final score is calculated as the
reward, and comparatively to a text discriminator            average of the scores over all hypothesis sentences.
is faster to train and showed better results.                Studies have shown (Semeniuta et al., 2019; Caccia
                                                             et al., 2020) that BLEU can only detect small syn-
3.3     Entropy-based rewards
                                                             tax problems, resulting in poor correlation with hu-
Our external rewards are calculated based on the             man evaluations, however it still remains the stan-
quality metric BLEU, which we briefly describe in            dard when evaluating the quality of generated texts.
Section 4.1. Under many considered reward speci-             To measure diversity we utilize Backwards-BLEU
fications, which included the addition of diversity          (BBLEU) where the generated texts are the refer-
metrics, maximum entropy RL, changes of tem-                 ence and the test set becomes the hypothesis, giving
perature, or a combination of these, the increase            a measure of how much the test set is represented.
in quality is counterbalanced with a drop in the             Additionally, we consider the Fréchet Distance
diversity of the generated sentences. One approach           with the InferSent embedding model (FID). It has
that managed to balance the quality-diversity trade-         been shown that FID responds better than BLEU
off was the addition of an intrinsically motivated           at identifying mode collapse and changes in words
penalty based on the confidence of the model when            usage. However, we show that it can be biased due
generating the token, calculated by the entropy.             to its distributional assumptions, mainly for differ-
If again we consider the hidden states ρ, the last           ences in sentence length distribution. Nonetheless,
layer calculating the logits as our policy π with            it can be useful in identifying very homogeneous
parameters θ and the entropy as H(π(·|ρ, St )), we           outputs, especially in conjunction with BLEU and
calculate the intrinsic rewards and the performance          BBLEU scores.
objective as:
                                                             4.2      Datasets
      Rtin   = log(clamp(H(π(·|ρ, St )), 0.2, 1))      (8)   We consider three of most widely used datasets for
                                                             unconditional text generation: image COCO (Chen
                        t
                      X                                   et al., 2015), Stanford Natural Language Inference
                                  k
    ∇θ J(θ t ) =Eπ              (γ Rk ) +   Rtin             (SNLI) (Bowman et al., 2015a) and the EMNLP
                          k=0
                                                      (9)   News 2017 dataset2 . Moreover, we consider the
                    · ∇θ ln π(At |ρ, St , θ)                 YELP review dataset for conditional text genera-
                                                             tion (Asghar, 2016). Each of the datasets presents

This specification favours high-reward actions with                                     COCO       SNLI   EMNLP   YELP
high entropy, while low-entropy actions have to                     Conditional           x          x      x       X
                                                              Average sentence length    11.3       9.7    28.8    96.4
have a high enough reward to be able to keep their                Size of train set      10k       100k    270k   100k
high probability, resulting in a more diverse gen-                Size of dev set        10k        10k    10k     10k
eration. As a rule of thumb, we found out that                    Size of test set       10k        10k    10k     10k
penalties should be lower than the maximum over-
                                                             Table 1: Average length of the train set and number of
all reward.
                                                             sentences for each of the datasets used for evaluation.

4     Experimental settings                                  different challenges when training: COCO and
In this section, we introduce the automatic met-             SNLI are a small and medium-sized dataset with
rics and the datasets used for evaluation. For               short sentences, respectively. EMNLP is a large
comparison we consider a MLE model, SeqGAN,                  dataset with longer sentences. Lastly, the YELP
RankGAN, as implemented by the benchmarking                  dataset is a conditional, medium-sized dataset
platform Texygen (Zhu et al., 2018), and Scratch-            with very long sentences. The preprocessing on
GAN. We also provide further details on our RL               the datasets is minimal and limited to the YELP
finetuning and present issues with the current eval-         dataset.
                                                                2
uation metrics.                                                     http://www.statmt.org/wmt17/
Metrics        MLE      SeqGAN      RankGAN          ScratchGAN    Standard     Adaptive    OptAGAN
 BLEU-2 ↑        0.829     0.796        0.764             0.835       0.825        0.816        0.860
 BLEU-3 ↑        0.548     0.471        0.399             0.556       0.554        0.544        0.605
 BLEU-4 ↑        0.304     0.228        0.159             0.313       0.285        0.284        0.356
 BBLEU-2 ↑       0.840     0.762        0.728             0.824       0.765        0.805        0.841
 BBLEU-3 ↑       0.563     0.450        0.383             0.545       0.488        0.526        0.586
 BBLEU-4 ↑       0.317     0.221        0.163             0.303       0.261        0.278        0.350
   FID ↓         0.926     1.934        3.509             0.466       1.153        0.784        0.674

Table 2: EMNLP results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard
and adaptive updates and the other models implemented for comparison.

  Metrics        MLE      SeqGAN      RankGAN          ScratchGAN    Standard     Adaptive    OptAGAN
 BLEU-2 ↑        0.841     0.838        0.784             0.795       0.867        0.888        0.889
 BLEU-3 ↑        0.635     0.599        0.514             0.564       0.693        0.726        0.727
 BLEU-4 ↑        0.428     0.380        0.309             0.363       0.484        0.524        0.525
 BBLEU-2 ↑       0.843     0.768        0.771             0.800       0.786        0.764        0.764
 BBLEU-3 ↑       0.639     0.546        0.523             0.564       0.570        0.547        0.548
 BBLEU-4 ↑       0.433     0.347        0.347             0.362       0.374        0.361        0.362
   FID ↓         0.376     0.919        1.486             0.539       1.424        1.764        1.765

Table 3: SNLI results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard and
adaptive updates and the other models implemented for comparison.

  Metrics        MLE      SeqGAN      RankGAN          ScratchGAN    Standard     Adaptive    OptAGAN
 BLEU-2 ↑        0.854     0.866        0.834             0.862       0.917        0.919        0.920
 BLEU-3 ↑        0.664     0.667        0.641             0.678       0.792        0.792        0.794
 BLEU-4 ↑        0.459     0.452        0.423             0.478       0.627        0.617        0.620
 BBLEU-2 ↑       0.844     0.813        0.775             0.791       0.755        0.775        0.778
 BBLEU-3 ↑       0.656     0.616        0.553             0.573       0.550        0.579        0.584
 BBLEU-4 ↑       0.455     0.415        0.344             0.377       0.368        0.396        0.399
   FID ↓         0.588     0.756        2.347             1.001       2.297        1.908        1.903

Table 4: COCO results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard and
adaptive updates and the other models implemented for comparison.

4.3   Experimental results                              with the MLE model, the quality-diversity trade-
                                                        off favours our model for quality, and the MLE
Tables 2, 3, 4 show the results for the quality and     approach for diversity. Notably, for the EMNLP
diversity of the models. The changes between stan-      dataset, the curiosity-driven finetuning allows Op-
dard and adaptive updates mostly favour the adap-       tAGAN to surpass all models for both BLEU and
tive one, with larger gains in diversity, with the      BBLEU. We believe that the difference in the mag-
exception of the SNLI dataset. Additional con-          nitude of change between the EMNLP task and
siderations about the two updates can be found in       the COCO and SNLI ones is due the starting qual-
Appendix B. Therefore we only apply the RL fine-        ity of the model. In fact, as the RL finetuning
tuning on the adaptively trained model to obtain the    slightly prioritizes quality, so increases in BLEU
OptAGAN results. Our approach shows improve-            score, over diversity, the actual changes on the
ments under all metrics, albeit small for the COCO      word-by-word generation are very few for those
and SNLI datasets. In comparison with the other         two datasets. Compared to the other methods, ours
GAN models, OptAGAN boasts the highest qual-            is also better able to reproduce longer sequences of
ity, and average, or higher diversity. In comparison    words, as the growing differences between 2,3 and
4 n-grams metrics show. Regarding the FID scores,        sentences is close to the one of the test data, while
they mostly show the same behaviour as BBLEU.            ignoring the fact that the actual distribution may
However, we show in section 4.5 that the FID is          not match the test one. In fact, the score for the
biased for the sentence length distribution, that, in    ScratchGAN model is much lower than the one
contrast with other methods such as ScratchGAN,          for OptAGAN, although the distribution of the sen-
we do not model. From the computational cost             tences of ScratchGAN in the space misses most
point of view, the full training of our model took       of the distribution of the test sentences, as can be
at most 24 hours using a single Tesla V100 GPU.          seen from the PCA representation. Although only
Additional details about the training can be found       10-15% of the overall variability, over the InferSent
in Appendix A.                                           embedding dimensions, is explained by PCA, there
                                                         is a huge mismatch that is not addressed by the
4.4   Conditional generation                             use of the Fréchet distance, which favours homoge-
For the conditional generation task we follow the        neous length distributions over correct representa-
same procedure as the unconditional one, with            tion of the space.
the only exception of the addition of a classifier
network to better model the GAN generation de-           4.6   RL finetuning experiment
pending on the label. The results of table 5 are         In order to fully gauge the strenghts and weak-
very similar to the ones of the COCO and SNLI,           nesses of our entropy penalty approach we set up
where the entropy regularized finetuning performs        an experiment where we train a GAN model us-
slightly better than the base adaptive VAE-GAN           ing the pre-trained Optimus encoder and decoder
model, with small gains in diversity and quality.        that are not finetuned on the dataset, so we can
We present some examples of the generated sen-           finetune a lower quality model. We compare this
tences in Appendix C.                                    model with two curiosity-driven regularized model
                                                         for 1000 and 5000 epochs, respectively. We also
  Metrics      Standard     Adaptive    OptAGAN          use a higher learning rate as we are not interested
 BLEU-2         0.880        0.886        0.887          in preserving the structure of the original sentences
 BLEU-3         0.675        0.683        0.685          and present the results in Table 6.
 BLEU-4         0.448        0.456        0.458
 BBLEU-2        0.860        0.854        0.854            Metrics      VAEGAN        RL 1000      RL 5000
 BBLEU-3        0.666        0.660        0.661            BLEU-2        0.641         0.707        0.829
 BBLEU-4        0.453        0.449        0.451            BLEU-3        0.370         0.444        0.624
   FID          2.598        2.539        2.562            BLEU-4        0.198         0.243        0.402
                                                          BBLEU-2        0.752         0.752        0.717
Table 5: YELP results of the automatic metrics for the    BBLEU-3        0.478         0.497        0.479
base VAE-GAN models and the entropy regularized
                                                          BBLEU-4        0.266         0.286        0.283
OptAGAN.
                                                            FID          2.441         2.369        2.892

4.5   Analysis on FID space                              Table 6: Results of an experiment where we start with
                                                         an unoptimized VAE-GAN model and finetune it with
We discuss problems with the FID metric by show-         our RL strategy for 1000 and 5000 epochs on the
ing an heatmap of the two-dimensional principal          COCO dataset.
component analysis (PCA) representation and the
length distribution of the sentences for the SNLI           The starting model achieves much worse results
dataset. Previous works (de Masson d’Autume              than the fully optimized OptAGAN, especially with
et al., 2020) already investigated the dependency        regards to the BLEU score. After 1000 epochs of
of the FID scores on the length distribution, which      RL finetuning, the model improves over both qual-
can overshadow other problems with the generated         ity and diversity. However, when finetuning for
samples.                                                 longer, the model cannot balance anymore between
   We further reinforce those analyses, as a prime       exploration and exploitation. We believe this be-
example of this issue can be seen in Figure 2 and        haviour is because of the bias that the model has
3, where the FID scores get progressively higher         in finding high reward tokens. The RL finetuning
the more the length distribution of the generated        does not evaluate all the tokens in the vocabulary,
COCO length distribution of the generated sentences
                    COCO Test data                  COCO OptAGAN generated data                  0.16
    40                                                                                                                                      Test set
    30                                                                                                                                      OptAGAN (1.765)
                                                                                                 0.14                                       ScratchGAN (0.539)
    20
    10                                                                                                                                      MLE (0.376)
     0
                                                                                                 0.12
    10
    20                                                                                           0.10

                                                                                       Density
    30
    40                                                                                           0.08
          COCO ScratchGAN generated data              COCO MLE generated data
    40
                                                                                                 0.06
    30
    20
    10
                                                                                                 0.04
     0
    10                                                                                           0.02
    20
    30                                                                                           0.00
    40
                                                                                                        0       10     20      30      40      50      60
         40    20      0     20      40   60   40       20    0    20    40       60                                          Length

Figure 2: PCA representation of the FID space of the Figure 3: Length distribution of the four datasets and the
SNLI test set (top left), OptAGAN (top right), Scratch- FID scores for the SNLI dataset.
GAN (bottom left) and MLE (bottom right) generated
data.

so there might be multiple good scoring tokens                                         References
which are never considered during the finetuning.                                      Mehdi Allahyari, Seyed Amin Pouriyeh, Mehdi Assefi,
This means that our approach is limited by the qual-                                    Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez,
ity of the original model. A countermeasure that                                        and Krys J. Kochut. 2017. Text summarization tech-
might prevent this kind of behaviour from happen-                                       niques: A brief survey. CoRR, abs/1707.02268.
ing could be to increase the penalty for models                                        Nabiha Asghar. 2016. Yelp dataset challenge: Review
with higher average entropy, and tuning its value                                        rating prediction. CoRR, abs/1605.05362.
depending on the use case, to further encourage                                        Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and
heterogeneous generation.                                                                Noam Shazeer. 2015. Scheduled sampling for se-
                                                                                         quence prediction with recurrent neural networks.
                                                                                         CoRR, abs/1506.03099.
5    Conclusions and discussion
                                                                                       Samuel R. Bowman, Gabor Angeli, Christopher Potts,
In this work, we benchmark the combination of Op-                                        and Christopher D. Manning. 2015a. A large anno-
timus and GANs for a text VAE-GAN model, with                                            tated corpus for learning natural language inference.
                                                                                         In Proceedings of the 2015 Conference on Empiri-
results that already surpass current methods for the                                     cal Methods in Natural Language Processing, pages
quality of generated texts. We further improve this                                      632–642, Lisbon, Portugal. Association for Compu-
baseline using entropy-based curiosity-driven re-                                        tational Linguistics.
wards to improve both the quality and the diversity                                    Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-
of the model. This novel approach could benefit                                          drew M. Dai, Rafal Józefowicz, and Samy Bengio.
many models utilizing RL for text generation, and                                        2015b. Generating sentences from a continuous
supplementary research could be done into explor-                                        space. CoRR, abs/1511.06349.
ing advantage policy gradient, or proximal policy                                      Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
optimization with intrinsic rewards. This specifica-                                     Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
tion of the reward also allows researchers to priori-                                    Neelakantan, Pranav Shyam, Girish Sastry, Amanda
                                                                                         Askell, Sandhini Agarwal, Ariel Herbert-Voss,
tize quality, diversity, or to balance between both.                                     Gretchen Krueger, Tom Henighan, Rewon Child,
Due to our limited computational resources, we                                           Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
utilize smaller batch sizes than we would otherwise                                      Clemens Winter, Christopher Hesse, Mark Chen,
have preferred, as larger batch sizes could help                                         Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
                                                                                         Chess, Jack Clark, Christopher Berner, Sam Mc-
in reducing the high variance gradients of these
                                                                                         Candlish, Alec Radford, Ilya Sutskever, and Dario
approaches. Moreover, further research on the au-                                        Amodei. 2020. Language models are few-shot learn-
tomatic metrics could be beneficial not only for                                         ers.
evaluation, but also for better reward signals to                                      Massimo Caccia, Lucas Caccia, William Fedus, Hugo
improve the speed and quality of the finetuning.                                        Larochelle, Joelle Pineau, and Laurent Charlin.
                                                                                        2020. Language gans falling short.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-               Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant,
  ishna Vedantam, Saurabh Gupta, Piotr Dollár, and           and Lior Wolf. 2017. Language generation with re-
  C. Lawrence Zitnick. 2015. Microsoft COCO cap-             current generative adversarial networks without pre-
  tions: Data collection and evaluation server. CoRR,        training. CoRR, abs/1706.01399.
  abs/1504.00325.
                                                           Alec Radford, Jeff Wu, Rewon Child, David Luan,
Cyprien de Masson d’Autume, Mihaela Rosca, Jack              Dario Amodei, and Ilya Sutskever. 2019. Language
  Rae, and Shakir Mohamed. 2020. Training language           models are unsupervised multitask learners.
  gans from scratch.
                                                           Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                Benjamin Piwowarski, and Jacopo Staiano. 2020.
   Kristina Toutanova. 2018. BERT: pre-training of           Coldgans: Taming language gans with cautious sam-
   deep bidirectional transformers for language under-       pling strategies.
   standing. CoRR, abs/1810.04805.
                                                           Stanislau Semeniuta, Aliaksei Severyn, and Sylvain
David Donahue and Anna Rumshisky. 2018. Adversar-             Gelly. 2019. On accurate evaluation of gans for lan-
  ial text generation without reinforcement learning.         guage generation.
  CoRR, abs/1810.06640.
                                                           Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,          Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
   Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron         Kaiser, and Illia Polosukhin. 2017. Attention is all
   Courville, and Yoshua Bengio. 2014. Generative ad-        you need. CoRR, abs/1706.03762.
   versarial networks.
                                                           Qingyang Wu, Lei Li, and Zhou Yu. 2021. Textgail:
Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong            Generative adversarial imitation learning for text
   Yu, and Jun Wang. 2017. Long text generation via          generation.
   adversarial training with leaked information.
                                                           Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Md. Akmal Haidar, Mehdi Rezagholizadeh, Alan Do-             Le, Mohammad Norouzi, Wolfgang Macherey,
 Omri, and Ahmad Rashid. 2019. Latent code and               Maxim Krikun, Yuan Cao, Qin Gao, Klaus
 text-based generative adversarial networks for soft-        Macherey, Jeff Klingner, Apurva Shah, Melvin John-
 text generation. CoRR, abs/1904.07293.                      son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,
                                                             Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Yanghoon Kim, Seungpil Won, Seunghyun Yoon, and              Stevens, George Kurian, Nishant Patil, Wei Wang,
  Kyomin Jung. 2020. Collaborative training of gans          Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
  in continuous and discrete spaces for text generation.     nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
                                                             and Jeffrey Dean. 2016. Google’s neural machine
Yoon Kim, Sam Wiseman, Andrew C. Miller, David               translation system: Bridging the gap between human
  Sontag, and Alexander M. Rush. 2018. Semi-                 and machine translation. CoRR, abs/1609.08144.
  amortized variational autoencoders.
                                                           Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
Matt J. Kusner and José Miguel Hernández-Lobato.             2016. Seqgan: Sequence generative adversarial nets
 2016. Gans for sequences of discrete elements with          with policy gradient. CoRR, abs/1609.05473.
 the gumbel-softmax distribution.
                                                           Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan-
Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiu-           der M. Rush, and Yann LeCun. 2017. Adversari-
  jun Li, Yizhe Zhang, and Jianfeng Gao. 2020. Opti-         ally regularized autoencoders for generating discrete
  mus: Organizing sentences via pre-trained modeling         structures. CoRR, abs/1706.04223.
  of a latent space.
                                                           Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,           Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy-
   Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-          gen: A benchmarking platform for text generation
   forcement learning for dialogue generation. CoRR,         models. CoRR, abs/1802.01886.
   abs/1606.01541.
                                                           Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B.
Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,           Brown, Alec Radford, Dario Amodei, Paul F. Chris-
  and Ming-Ting Sun. 2017. Adversarial ranking for           tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
  language generation. CoRR, abs/1705.11001.                 guage models from human preferences. CoRR,
                                                             abs/1909.08593.
Weili Nie, Nina Narodytska, and Ankit Patel. 2019.
  RelGAN: Relational generative adversarial net-
 works for text generation. In International Confer-
  ence on Learning Representations.

Xu Ouyang and Gady Agam. 2020. Accelerated wgan
  update strategy with loss change rate balancing.
A       Training details                                       B               Adaptive strategy details
For the finetuning of Optimus, we follow the origi-            Figure 4 and 5 show the validation BLEU and
nal work and train for one epoch with the hyperpa-             BBLEU for the COCO and YELP datasets. We
rameters that give the best reconstruction quality,            show the results over these two datasets due to the
namely:                                                        stark difference between them.

    •   Pre-trained model epoch = 508523                                                BLEU scores for different update strategies
    •   Training epochs3 = 1                                                  1.2
    •   Learning rate = 5 · 10−5                                              1.0
    •   Batch size = 5                                                        0.8

                                                                BLEU scores
    •   Latent size = 768                                                     0.6
    •   β=0                                                                   0.4                                               BLEU - standard
                                                                                                                                B-BLEU - standard
                                                                                                                                Sum - standard
    •   Annealing ratio = 0.5                                                 0.2                                               BLEU - adaptive
                                                                                                                                B-BLEU - adaptive
                                                                                                                                Sum - adaptive
    •   Ratio increase = 0.25                                                 0.0
                                                                                    0       10        20            30     40               50
                                                                                                           Epochs
   We follow a similar approach for the training of
the GAN part of the model, where we use standard               Figure 4: Differences in BLEU scores during training
hyperparameters for the training of WGAN-GP.                   for the standard and adaptive update in GANs, evalu-
The best performing epoch of the GAN according                 ated on the COCO validation set.
to the sum of the BLEU and BBLEU partial scores
with 500 texts for both reference and hypothesis is
                                                                                        BLEU scores for different update strategies
saved.                                                                        1.4

    • Training epochs = 50                                                    1.2
                                                                              1.0
    • Learning rate = 10−4
                                                                BLEU scores

                                                                              0.8
    • Batch size = 256
                                                                              0.6
    • Latent size = 768                                                                                                         BLEU - standard
                                                                              0.4                                               B-BLEU - standard
    • Maximum sequence length = 100                                                                                             Sum - standard
                                                                              0.2                                               BLEU - adaptive
                                                                                                                                B-BLEU - adaptive
                                                                                                                                Sum - adaptive
    • Number of blocks of generator and discrimi-                             0.0
                                                                                    0       10        20            30     40               50
      nator = 10                                                                                           Epochs
    • Gradient penalty λ = 10
                                                               Figure 5: Differences in BLEU scores during training
   Finally, these are the details for the entropy reg-         for the standard and adaptive update in GANs, evalu-
                                                               ated on the YELP validation set.
ularized finetuning. Empirical results showed next
to no difference for the BLEU n-gram choice, so
we choose 1-gram due to slightly faster computa-                  It is evident that the adaptive strategy is slightly
tional times, and it also translates into a clearer            slower to converge. Since the YELP dataset is 10
understanding of the intermediate values. More-                times larger than COCO, in both cases after about
over, we use a small learning rate, in order to keep           200,000 samples the two models reach the same
the same structure as the original sentences.                  quality. The remainder of the training appears to
                                                               be very stable for the standard update, with lit-
    •   BLEU reward n-grams = 1                                tle to no changes in both the scores. Meanwhile,
    •   Finetuning epochs = 1000                               the adaptive updates look slightly more volatile,
                                                               with changes that impact the two scores both posi-
    •   Learning rate = 10−6
                                                               tively and negatively, usually following the quality-
    •   Batch size = 32                                        diversity trade-off principle.
    •   Epochs value head pre-training = 200
    •   Learning rate value head pre-training = 10−4
                                                               C               Generated samples
   3
     Due to the size of the COCO dataset, it is the only one   We include several randomly sampled sentences
where we finetune for 5 epochs.                                from OptAGAN for each dataset:
Dataset      Sentences
               a man posing with a bike inside of a forest .
  COCO         a man sitting on a swinging chair pulled by some purple ducks .
               a woman holding a bird over some flowers on the beach .
               the man is sweating because they are blue .
   SNLI        the man is getting more thank with the dog .
               a man in white clothes stands next to a marketplace where he can store plastic .
               Republican presidential nominee , Donald Trump , has said that 20 to 30 years might
               be the way to try and narrow it out .
 EMNLP         Whether or not agenda minutes can deliver , it would , therefore , encourage the
               majority of Scottish MPs to think about that .
               Earlier in the day , Bell travelled to Sydney ’ s Supreme Court and was effectively
               blocking a vote of no - one that would produce the album .

Table 7: Examples of the generated unconditional sentences from OptAGAN trained on COCO, SNLI and EMNLP
dataset.

 Stars      Generated reviews
   1        horrible experience at this mcdonalds . i have no idea what they are trying to sell
            you , but if you want anything other than vanilla bean puree and no rice at all , do
            yourself a favor and go here . for the most part , the waiters do not give a shit .
   2        i have been to some great buffets , but this was mediocre at best . my husband ordered
            a turkey sandwich and it was just like anyone else’s sandwich . for the service , there
            wasn’t much seating . the food and the _UNK were stale , awful bread . something
            to do if you go to vegas for dinner and want to have some classic awesomeness ...
            maybe try a dim sum instead .
   3        this place is pretty good . i like the burgers , the selection is pretty good and they
            have a ginormous amount of steak . being a vegetarian , i took one bite of everything
            i ordered and went back again . for the friday afternoon rush - they brought me the
            french fries instead of the turkey , a mousse and mgr .
   4        loved this place . i had the corned beef burrito , which was very good , though a bit
            greasy and lacking . they have a good selection of veggie options and happy hour
            specials and are very attentive to your meal . it’s clean , spacious , and the ambiance
            is great . i have definitely come here when my friends visit vegas to see if they have
            any other option . the best part about going here is the sitting area outside where you
            can hang out while eating all you could eat .
   5        love this place . i have visited all of the great restaurants that offer all sorts of flavors
            , and to top it all off , all the facilities are extremely clean . _UNK is the guy . he
            came for a quick check up , took my dad to the bar and came out free of charge !!
            seriously !

Table 8: Examples of the generated sentences from OptAGAN trained on the YELP dataset conditioned on the
label.
You can also read