Neural-Driven Search-Based Paraphrase Generation

Page created by Gerald Farmer
 
CONTINUE READING
Neural-Driven Search-Based Paraphrase Generation
                      Betty Fabre                                      Jonathan Chevelu
                  Orange Labs, Lannion                             Univ Rennes, IRISA Lannion
                  Univ Rennes, IRISA                            jonathan.chevelu@irisa.fr
              betty.fabre@orange.com

                    Tanguy Urvoy                                          Damien Lolive
                 Orange Labs Lannion                                Univ Rennes, IRISA Lannion
             tanguy.urvoy@orange.com                               damien.lolive@irisa.fr

                        Abstract                                transformer or a seq2seq model trained on a ques-
                                                                tion answering corpus will typically turn any input
      We study a search-based paraphrase genera-                sentence into a question. With the lack of generic
      tion scheme where candidate paraphrases are
                                                                aligned datasets, it remains challenging to train
      generated by iterated transformations from the
      original sentence and evaluated in terms of               generic paraphrase models in a supervised manner.
      syntax quality, semantic distance, and lexical               On the other hand, with the availability of large-
      distance. The semantic distance is derived                scale non-supervised language models like BERT
      from BERT, and the lexical quality is based on            and GPT2, the assessment of a given candidate
      GPT2 perplexity. To solve this multi-objective            paraphrase in terms of semantic distance from its
      search problem, we propose two algorithms:                source and lexical quality has become much more
      Monte-Carlo Tree Search For Paraphrase Gen-
                                                                tractable.
      eration (MCPG) and Pareto Tree Search (PTS).
      We provide an extensive set of experiments                   Leveraging from these metrics, we propose to
      on 5 datasets with a rigorous reproduction                cast the paraphrase generation task as a multi-
      and validation for several state-of-the-art para-         criteria search problem. We use PPDB 2.0 (Pavlick
      phrase generation algorithms. These experi-               et al., 2015), a large-scale database of rewriting
      ments show that, although being non explic-               rules derived from bilingual corpora, to poten-
      itly supervised, our algorithms perform well              tially generate billions of ’naive’ candidate para-
      against these baselines.
                                                                phrases by edition from the source sentence. To
                                                                sort the good candidates efficiently from the others,
 1    Introduction
                                                                we experiment with two search algorithms. The
 Paraphrase generation, i.e. the transformation of              first one, called Monte-Carlo Paraphrase Genera-
 a sentence into a well-formed but lexically differ-            tion (MCPG), is a variant of the Monte-Carlo Tree
 ent one while preserving its original meaning, is              Search algorithm (MCTS) (Kocsis and Szepesvári,
 a fundamental task of NLP. Its ability to provide              2006; Gelly and Silver, 2007; Chevelu et al., 2009).
 diversity and coverage finds applications in several           The MCTS algorithm is famous for its successes on
 domains like question answering (McKeown, 1979;                mastering the – highly combinatorial – game of Go
 Harabagiu and Hickl, 2006), machine-translation                (Gelly and Silver, 2007; Silver et al., 2016).
 (Callison-Burch et al., 2006), dialog systems (Yan                The second one is a novel search algorithm that
 et al., 2016), privacy (Gröndahl and Asokan, 2019a)           we call Pareto Tree Search (PTS). In contrast to
 or adversarial learning (Iyyer et al., 2018).                  MCTS which is a single-criterion search algorithm,
    The formal definition of paraphrase may vary                we designed PTS to retrieve an approximation of
 according to the targeted application and the tol-             the whole Pareto optimal set. This allows for more
 erance we set along several axes, including the                flexibility on paraphrase generation where the bal-
 semantic distance from the source that we want to              ance between semantic distance, syntax quality,
 minimize, the quality of the syntax, and the lexical           and lexical distance is hard to tune a priori. An-
 distance from the source that we want to maximize              other difference between MCPG and PTS is that PTS
 to ensure diversity.                                           uses a randomized breadth-first exploration policy
    The available aligned paraphrase corpora are of-            which proves to be more efficient on this problem.
 ten biased toward specific problems like question                 The main contribution of this article is a study
 answering or image captioning. For instance, a                 on search-based paraphrase generation through the

                                                           2100
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2100–2111
                            April 19 - 23, 2021. ©2021 Association for Computational Linguistics
Source sentence :          he is speaking on june 14 .            By iteratively applying the rules from a source
 PPDB rule                  Edited sentence                     sentence like the one in Table 1, we obtain a vast
 is → is found              he is found speaking on june 14 .
 is speaking → ’s talking   he ’s talking on june 14 .          lattice of candidate paraphrases. Some of these
 speaking → speak now       he is speak now on june 14 .        candidates like “he’s talking on june 14” are well-
 14 → 14th                  he is speaking on june 14th .
                                                                formed, but many are syntactically broken, like “he
    Table 1: PPDB rules applied to a source sentence            is speak now on june 14”.
                                                                   The number of rules that apply depends on the
                                                                source sentence’s size and the words it contains.
sieve of three criteria: semantic similarity, syntax            For instance, on the MSRPARAPHRASE dataset (see
quality, and lexical distance. We propose and eval-             section 6.2), sentences are quite long and the me-
uate two search algorithms: MCPG and PTS. We                    dian number of PPDB - XL rules that apply is around
also provide an extensive set of experiments on                 450. After two rewriting steps, the median num-
English datasets with a rigorous reproduction and               ber of candidates is around 105 , and by iterative
validation methodology for several state-of-the-art             rewriting, we quickly reach a number of paraphrase
paraphrase generation algorithms.                               candidates that is greater than 108 .
   This article is organized as follows: The can-
didate paraphrase generation scheme is presented                3   Paraphrase selection criteria
in Section 2. In Section 3, we develop the differ-
ent criteria we use to qualify correct paraphrases.             As it depends on the type of text we consider that
The two search algorithms (MCPG and PTS) are                    may be spoken or written, casual or formal; it is
described in Section 4. Section 5 gives a survey                not easy to define a universal semantic distance or
on other state-of-the-art paraphrase generation al-             a universal scale of well formed syntax. However,
gorithms. The comparisons with non-supervised                   recent advances in NLP with neural networks like
and supervised baselines are presented in Section 6             BERT and GPT2 trained on huge corpora have
where the methodology is discussed in Section 6.3.              led to the development of metrics that can act as
We conclude by a data-augmentation experiment                   good proxies for these ideal notions.
in Section 6.6.                                                    For the semantic distance, a quick experiment
                                                                confirms that the BERT score (Zhang et al., 2019a)
2    Paraphrase generation scheme                               performs well on difficult paraphrase identification
                                                                tasks. The BERT score is an F1-measure over
We model paraphrase generation as a sequence of                 an alignment of the BERT contextual word em-
editions and transformations from a source sen-                 beddings of each of the sentences. To assess the
tence into its paraphrase. In this work, we only                sensitivity of this score, we computed the Area Un-
consider local transformations, i.e. replacement                der the ROC curve (AUC) on QQP and PAWS, two
of certain words or group of words by others that               difficult paraphrase identification corpora (Kornél
have the same or similar meanings, but the method               Csernai, 2017; Zhang et al., 2019b). On QQP, we
we propose should work with more sophisticated                  obtained 75.2% and on PAWS, we obtained 67.0%.
transformations as well.                                        The PAWS corpus being designed to trick para-
   The Paraphrase Database (PPDB) (Ganitkevitch                 phrase identification classifiers, 67.0% is a reason-
et al., 2013; Pavlick et al., 2015) is a large collec-          able performance. We hence opted for the BERT
tion of scored paraphrase rules that was automat-               score between the source sentence and paraphrase
ically constructed from various bilingual corpora               candidate (denoted BERTS ) as our semantic score.
using a pivot alignment method (Callison-Burch,                    Regarding the syntax quality, the perplexity of
2008). The database is divided into increasingly                GPT2 (Radford et al., 2019) is a good ranking cri-
large and decreasingly accurate subsets1 . We used              terion. Although, as illustrated in Table 2, in some
the XL subset, and we removed the rules labeled                 cases, a rule-based spell-checker may detect errors
as “Independent”. This left us with a set of 5.5                that GPT2 would miss (but the reverse is also true).
million rewriting rules. We give some examples of               We hence opted for GPT2 as a primary criterion
these rules in Table 1.                                         for syntax quality, combined with the L ANGUAGE -
    1
                                                                T OOL spell-checker (Naber, 2003) that we only
     PPDB is available on http://paraphrase.org,
and a python interface is available here https://github.        used on a second stage for performance reasons.
com/erickrf/ppdb                                                   The lexical distance is important to ensure the

                                                            2101
diversity of the produced paraphrases. It is however     order to obtain a balance as the one described in
simple to handle. Some authors use the BLEU sur-         Table 2.
face metric (Miao et al., 2018a), we opted here for
the normalized character-level Levenshtein edition       4.2            PTS: Pareto Tree Search in the
distance.                                                               paraphrase lattice
   The balance between these criteria is difficult       We observed two drawbacks for MCTS.
to obtain. Table 2 illustrates their impact on a sen-
                                                            First, it was designed for combinatorial prob-
tence taken from the train set of MSRPARAPHRASE.
                                                         lems like Go where the evaluation is only possible
The candidate examples in this table underline the
                                                         on the leaves of the search tree. This is not the
tough dilemma between maximizing the semantic
                                                         case for paraphrase generation where the neural
similarity (safe and conservative policy) and maxi-
                                                         models can evaluate any rewriting step and where
mizing the lexical distance (risk-prone). The third
                                                         rewriting from good candidates is more likely to
and fourth examples underline the utility of the
                                                         provide good paraphrases than rewriting from bad
spell-checker: some low-perplexity examples are
                                                         ones. Secondly, it has been designed for single
ill-formed. On the second part of the table, we
                                                         criterion search which requires fixing the balance
printed the sentences chosen by our two models:
                                                         between criteria definitively before any paraphrase
MCPG and PTS .
                                                         search begins. This is not very flexible, and it be-
4     Searching algorithms                               comes painful when we want to generate sets of
                                                         candidates.
Searching for good paraphrases in the large lat-            By plotting the distributions of the scores like
tice of candidates generated by PPDB is a costly         on Figure 1, we noticed that most of the candidates
task. We propose two algorithms that share a simi-       were dominated in the Pareto sense: it was possible
lar structure: an outer loop explores the lattice at     to eliminate most of the candidates without any
different depths, while an inner loop explores the       hyper-parameter tuning. Hence, we adapted MCPG
candidates at each depth. Both algorithms are any-       to explore the paraphrase lattice and recover an ap-
time: they return the best solutions found so far        proximation of the Pareto front, postponing the bal-
when the time or space budget is depleted.               ance between criteria as a quick post-optimization
                                                         stage. This led us to the PTS algorithm described
4.1    MCPG: Monte-Carlo tree search for
                                                         as pseudo-code in Table 3.
       Paraphrase Generation
Following the idea of Chevelu et al. (2009), we                         1.00
used Monte-Carlo Tree Search (MCTS) to explore                                                               Score isolines
the PPDB lattice. The three key ingredients of MCTS                     0.98

are: a bandit policy at each node of a search-tree to                   0.96
                                                                                                                                    Pareto-Front
select the most promising paths, randomized roll-                                                                                   0
                                                           BERT Score

                                                                        0.94                                                        1
outs to estimate the quality of these paths, and back-                                                                              -GPT2
propagation of rewards along the paths to update                        0.92                                                          160
                                                                                                                                      140
the bandit. We opted here for a randomized bandit                                                                                     120
                                                                        0.90                                                          100
policy called EXP 3 (Auer et al., 2002). The MCTS                                                                                     80
algorithm being not designed for multi-objective                        0.88
problems, we needed to combine semantic similar-                        0.86
ity BERTS , syntax correctness GPT2 and surface                                0.0   0.1       0.2     0.3        0.4         0.5
                                                                                           Levhenstein Distance
diversity LevS into a single criterion. We opted
for the following polynomial:                            Figure 1: The cloud of candidates generated from the
                                                         sample of Table 2. The optima of any positive combi-
 α · BERTS + β · LevS · BERTS − γ · GPT2 (1)             nation of BERT score, normalized Levenshtein distance,
                                                         and GPT 2 perplexity belong to the Pareto front (orange
where the product LevS · BERTS is intended to            dots). We plotted the projections of MCPG combined
avoid a trivial maximization of the score by apply-      score (1) with dashed isolines. The BERT score and
ing a lot of editions to the source sentence. After a    Levenshtein distance being clearly anti-correlated, the
few experiments on train sets, we tuned empirically      balance between these two criteria is difficult to tune.
the weights to α = 3, β = 0.5 and γ = 0.025 in

                                                    2102
Source             the agency could not say when the tape was made , though the voice says he is speaking on june 14 .
    Policy             Candidate Sentence                                             BERTS GPT2 LevS Errs
    high BERT          the agency could not say when the tape was made , though
                                                                                       0.99     4.23      0.04       0
    conservative       the voice says he is talking on june 14 .
                       the organizations and agencies could just ’m saying when-
    high Lev           ever the tape-based was provided presentation there are ,
                                                                                       0.60     7.83       1.0       0
    risk prone         however the express opinions informed that he was found
                       talking pertaining end-june fourteen .
                       the agencies was not able say when the tape been given
    high GPT2
                       currently undertaking though the voice indicated he             0.82     7.10      0.55       2
    bad syntax
                       talking on june 14 .
                       the organisation are incapable of ’re saying when the tape
    spell & grammar
                       is set out , despite the fact that the voice just said he have  0.73     5.49      0.79       2
    errors
                       a conversation on june 14 .
                       the organization could not say when the tape was made ,
    balanced                                                                           0.95     4.34      0.28       0
                       although the voice indicates that he is talking on june 14 .
                       the organization could not say when the tape was made ,
    MCPG output                                                                        0.96     4.36      0.26       0
                       though the voice indicates that he is talking on june 14 .
                       the agency could not say when the tape was made ,
    PTS output                                                                         1.00     4.17      0.02       0
                       although the voice says he is speaking on june 14 .
                       the agency could put no exact date on the tape , though
    Target                                                                             0.87     4.77      0.22       0
                       the voice says he is speaking on june 14 .

Table 2: An example of a source sentence sampled from the MSRPARAPHRASE train set with some representative
candidates from its PPDB rewriting graph. For each of the candidates, we computed the BERT score with respect
to the source (BERTS ), the normalized GPT2 perplexity (GPT2), the Levenshtein distance from the source
sentence (LevS ), and the number of spell and grammar errors detected (Errs). The PPDB editions are highlighted
in green. The detected spell and grammar errors are highlighted in purple. The first candidate maximizes the
semantic similarity (BERT score) and is very conservative. The second one maximizes the surface diversity (LevS )
and takes a lot of risks. The third one shows an ill-formed paraphrase candidate. The fourth candidate emphasizes
the utility of the spell-checker. The last candidate, on the fifth row, achieves our equilibrium goal. We show on
a second part, the paraphrases generated by our models MCPG an PTS, and on a third part, we give the reference
paraphrase from the dataset.

                              L
    function PTS(input sentence)
                                                                substitution rules can be extracted by sub-sentence
       candidates ← REWRITE(input sentence)
       depth ← 1                                                alignment algorithms from parallel corpora (Brown
       while time/space < budget do                             et al., 1993). Building such a dedicated corpus
           while time/space < layer-budget do
               batch ← SAMPLE(candidates)
                                                                being a long and costly task, one usually trans-
               scored ← scored ∪ NN. SCORES(batch)              forms other corpora through a “pivot represen-
           end while                                            tation” (Barzilay and McKeown, 2001; Callison-
           layer front set ← PARETO - FRONT(scored)
           candidates ← REWRITE(layer front set)                Burch et al., 2006; Ganitkevitch et al., 2013; Chen
           depth ← depth + 1                                    et al., 2015). The weakness of these approaches is
       end while                                                that phrase-level rewriting rules alone are not able
       return PARETO - FRONT(all scored nodes)
    end function                                                to build coherent sentences. A typical data-driven
                                                                paraphrase generator used to be a mixture of poten-
      Table 3: Pareto Tree Search (PTS) algorithm               tially noisy handcrafted and data-driven rewriting
                                                                rules coupled with a score that had to be optimized
5     Related work                                              in real-time through dynamic programming. How-
                                                                ever, dynamic programming methods like Viterbi
Rule-based and statistical approaches Follow-                   are constrained by the requirement of a score that
ing the path of machine translation, the para-                  decomposes into a sum of word-level or phrase-
phrase generation literature first evolved from la-             level criteria (Xu et al., 2016). Some attempts were
boriously handcrafted linguistic rules (McKeown,                made to relax this constraint with search-based ap-
1979; Meteer and Shaked, 1988; Chandrasekar and                 proaches (Chevelu et al., 2009; Daumé et al., 2009),
Srinivas, 1997; Carroll et al., 1999) to more au-               but the global optimized criteria were simplistic,
tomatic and data-driven rules extraction methods                and the obtained solutions were not suitable for
(Callison-Burch et al., 2006; Madnani and Dorr,                 practical deployment.
2010). Like in machine translation, phrase-level

                                                          2103
Supervised encoder-decoder approaches Like                2019b) they deployed a search-based policy for
machine translation, paraphrase generation bene-          style imitation, Schwartz and Wolter (2018) and
fited from deep neural networks and evolved to effi-      Kumagai et al. (2016) both used MCTS for text
cient end-to-end architectures that can both learn to     generation. Following the same trend, Miao et al.
align and translate (Bahdanau et al., 2016; Vaswani       (2018a) proposed to use Metropolis-Hasting sam-
et al., 2017). Several papers like (Prakash et al.,       pling (Metropolis et al., 2004) for constrained sen-
2016a; Cao et al., 2017) set the paraphrase gener-        tence generation in an algorithm called CGMH.
ation task as a supervised sequence-to-sequence               Starting from the source sentence, the CGMH
problem. As confirmed by our experiments in Sec-          algorithm samples a sequence of sentences by us-
tion 6.4, this approach is efficient for specific types   ing local editions: word replacement, deletion, and
of paraphrases. It is also able to produce relatively     insertion. For paraphrase generation, CGMH con-
long-range transformations, but it requires huge          straints the sentence generation using a matching
and high-quality sentence-level aligned datasets for      function that combines a measure of semantic simi-
training.                                                 larity and a measure of English fluency. This model
   The paraphrase generation literature mostly re-        is therefore directly comparable with our MCPG and
ports results on MSCOCO (Chen et al., 2015) and           PTS approaches.
QQP (Kornél Csernai, 2017) datasets which are
built respectively from image captions and semi-          6     Experiments
duplicate questions. These datasets are very spe-
                                                          The evaluation metrics and datasets are described
cific: MSCOCO is strongly biased toward image
                                                          respectively in Section 6.1 and 6.2. We paid at-
description sentences, and QQP is dedicated to ques-
                                                          tention to set up a rigorous validation protocol for
tions. The Transformer model we trained on QQP
                                                          our experiments. The reproducibility and method-
typically transforms any input into a question. For
                                                          ology issues that we faced are discussed in Sec-
instance, from “He is speaking on june 14.” it gives
                                                          tion 6.3. We compare our model with state-of-the-
“Who is speaking on june 14?”.
                                                          art supervised methods and with CGMH, another
Generative and hybrid approaches Recent ap-               search-based algorithm. The technical details on
proaches rely on conditional generative models like       these algorithms are developed in Section 6.4. The
CVAE to allow the generation of paraphrase sets and
                                                          results are detailed in Section 6.5.
palliate the lack of supervision data (Gupta et al.,
                                                          6.1       Evaluation metrics
2018; Yang et al., 2019a; Roy and Grangier, 2019;
An and Liu, 2019). Others make use of CGANs               We rely on standard machine translation metrics
combined with reinforcement learning (Li et al.,          that compare the generated paraphrase to one or
2018; Wang and Lee, 2018). Witteveen and An-              several ground-truth references (Olive et al., 2011).
drews (2019) propose to simply fine-tune a large          We report surface metrics BLEU and TER and se-
non-supervised model like GPT2.                           mantic metrics METEOR and BERT2 , the average
   Recent text-generation architectures like the one      BERT score of the generated sentence with respect
proposed in (Moryossef et al., 2019; Fu et al., 2019)     to the reference sentences3 .
make a clear separation between a rule-based plan-
                                                          6.2       Evaluation datasets
ning phase and the neural realization. A similar
idea was tested by Huang et al. (2019) where PPDB         Table 4 gives a summary of the datasets we used.
rules are used to control the decoder of a CVAE.          On one hand, we have large datasets like MSCOCO
                                                          or OPUSPARCUS (OPUSPAR .) that are noisy and
Search-based approaches Search-based meth-                very specific. On the other hand, we have MSR -
ods regained interest in the text generation commu-       PARAPHRASE ( MSRPARA .), a high-quality but
nity for several reasons, including the need for flex-    small human-labeled dataset.
ibility and the fact that with deep neural-networks,            2
                                                                 This is the same metric, but this usage of BERT score
the search evaluation criteria have become more           against reference must not be confused with the BERT score
reliable. These methods are often slower than auto-       against source (BERTS ) that we use in (1).
                                                               3
regressive text generation methods, but it is always             We used the scripts from github.com/jhclark/
                                                          multeval and github.com/Tiiiger/bert_score.
possible to distillate the models into faster ones like   For the BERT score we used ’bert-base-uncased L8 no-
we do in Section 6.6. In (Gröndahl and Asokan,           idf version=0.1.2’

                                                     2104
Corpus        Size      Len   B>0.75 L
6.4   Baseline systems implementation                    CGMH . Overall, these results are mixed: it is how-

In the next subsections, we present the re-              ever important to keep in mind that contrary to the
implemented encoder-decoder neural networks ar-          supervised baselines which are retrained for each
chitectures and the weakly-supervised paraphrase         dataset, the parameters of the CGMH, MCPG and
                                                         PTS models are left unchanged.
generator used as baselines in our experiments.
                                                            On the MSCOCO and QQP datasets, the super-
Supervised paraphrase generators As super-               vised baselines achieve clearly better results, but
vised baselines, we trained three neural network ar-     MCPG and PTS achieve better results on OPUSPAR -
chitectures that were previously reported to achieve     CUS and PAWS except with the BERT score for
good results on MSCOCO and QQP, in particular, the       which the TRANSFORMER model achieves simi-
Seq2Seq architecture, a Residual LSTM architec-          lar results. On MSRPARAPHRASE, the encoder-
ture (Prakash et al., 2016b) and a TRANSFORMER           decoder neural networks models perform poorly.
model (Egonmwan and Chali, 2019a). We ex-                This result can be explained by the small number
tended the experiments to the other aligned corpora:     of training examples available on this corpus (See
MSRPARAPHRASE , OPUSPARCUS and PAWS .                    Table 4). On the weakly-supervised side, MCPG
   To be more precise, we trained a 4-layers LSTM        and PTS models outperform the CGMH baseline on
Seq2Seq with a bidirectional encoder and decoder         all corpora except on the MSCOCO dataset where
using attention. This architecture is reported as        the results are similar.
S EQ 2S EQ in the results. We trained a 4-layer Resid-      These results prove that even without a special-
ual LSTM Seq2Seq as introduced by Prakash et al.         ized training sets, generic search-based methods
(2016b) and reproduced by Fu et al. (2019). This         are competitive for paraphrase generation. How-
architecture is reported as R ESIDUAL LSTM in the        ever, it is a fact that encoder-decoder networks have
results. The results we obtained with this model         excellent performances for text generation and have
are close to the ones reported by Fu et al. (2019).      the potential to generate more complex paraphrases
   Finally, we trained a TRANSFORMER using               than those obtained by simple local transformations
the transformer base hyper-parameters set from           as in our models.
(Vaswani et al., 2017). This architecture is reported       Training a general – all-purpose – paraphrase
as TRANSFORMER BASE in the results.                      generation network would require a huge volume
   For all the encoder-decoder experiments, we           of data. And there is yet much less aligned corpora
used the fairseq framework (Ott et al., 2019) that       available for paraphrase than for translation.
implements the SEQ 2 SEQ and TRANSFORMER ar-
chitectures. We added our own implementation of          6.6   Data-augmentation experiment
the RESIDUAL LSTM architecture.
                                                         In order to get the best of both worlds, one option
   For preprocessing, we used Moses tokenizer and
                                                         is to enrich the training set of a TRANSFORMER
subword segmentation following Sennrich et al.
                                                         with the results of a search-based method.
(2016b) and using the subword-nmt library . The
                                                            To test this idea, we used our models to aug-
maximum sentence length is set to 1024 tokens
                                                         ment the MSRPARAPHRASE training set. For that
which the default setting in fairseq. For decoding,
                                                         purpose we created new pairs of paraphrases from
we did a beam search with a beam of size 5.
                                                         unused sentences of MSRPARAPHRASE (the pairs
Weakly-supervised paraphrase generator For               labeled as “not paraphrases”) using MCPG and PTS.
the weakly-supervised strategy CGMH introduced           We then trained a new supervised T RANSFORMER
by Miao et al. (2018b) we used the official code.        models on the augmented training sets.
We managed to reproduce their results on QQP.               Having no guarantee that our models generate
On our test set we achieve a BLEU score of 22.5          syntactically perfect sentences, we inverted the
while they reported 18.8. We then extended the           pairs, thus taking the generated paraphrases as in-
experiment to other datasets and metrics.                put and the dataset’s sentences as output. This trick,
                                                         called back-translation, forces the target model to
6.5   Results                                            generate correct sentences (Sennrich et al., 2016a;
Table 6 summarizes the results of the comparison         Edunov et al., 2018).
of our models, supervised encoder-decoder neu-              We report the results of this experiment in Ta-
ral networks and the weakly-supervised method            ble 7. The models trained with the augmented

                                                    2106
Corpus              Model                       BLEU ↑   T ER ↓    M ET EOR ↑      BERT ↑
                               S EQ 2S EQ                   27.5     62.3        24.3          0.76
                               R ESIDUAL LSTM               26.9     63.3        24.2          0.76
                               T RANSFORMER BASE            26.9     63.3        24.2          0.76
           MSCOCO
                               CGMH                         17.3     72.6        21.9           0.7
                               MCPG                         16.5     73.5        23.2          0.71
                               PTS                          17.0     69.9        22.8          0.64
                               S EQ 2S EQ                   29.2     60.3        30.7           0.8
                               R ESIDUAL LSTM               28.4     59.1        30.2           0.8
                               T RANSFORMER BASE            29.1     59.5        30.5           0.8
           QQP
                               CGMH                         22.5     65.0        27.0          0.72
                               MCPG                         24.1     64.5        31.8          0.78
                               PTS                          25.6     58.7        31.4          0.78
                               S EQ 2S EQ                    8.4     79.3        13.8          0.69
                               R ESIDUAL LSTM                8.1     78.6        14.3           0.7
                               T RANSFORMER BASE             8.1     84.3        13.9           0.7
           OPUSPARCUS
                               CGMH                          7.6     78.9        16.8          0.58
                               MCPG                          9.6     78.6        23.3          0.67
                               PTS                           9.1     70.2        22.1          0.66
                               S EQ 2S EQ                   44.0     36.8        39.2          0.92
                               R ESIDUAL LSTM               43.6     37.1        38.9          0.92
                               T RANSFORMER BASE            42.4     37.6        38.9          0.92
           PAWS
                               CGMH                         15.4     58.1        20.7          0.61
                               MCPG                         55.5     24.3        49.2          0.93
                               PTS                          57.9     21.9        48.5          0.92
                               S EQ 2S EQ                   11.6     89.5        12.6          0.53
                               R ESIDUAL LSTM               10.5     93.7        11.2          0.52
                               T RANSFORMER BASE            20.7     76.7        21.6          0.65
           MSRPARAPHRASE
                               CGMH                          9.7     72.9        15.4          0.48
                               MCPG                         39.3     52.4        37.2          0.81
                               PTS                          40.3     48.4        36.1          0.80

Table 6: Experiments summary. Symbol ’↑’ means that higher value is better. Significantly best values are
marked in bold. MCPG and PTS outperform state-of-the-art models on OPUSPAR ., PAWS and MSRPARA .. Contrary
to the supervised baselines, the parameters of the MCPG and PTS models are left unchanged for each dataset.

 Train set        BLEU    T ER     M ET EOR BERT             MCPG   and PTS, are comparable to supervised state-
 ORIG .            20.7   76.7        21.6   0.65
 ORIG . + MCPG     25.3   69.6        24.5   0.63            of-the-art baselines despite being less tightly su-
 ORIG . + PTS      25.3   71.1       24.6    0.63            pervised. When compared with CGMH, another
                                                             search-based and weakly-supervised method, our
Table 7: Data-augmentation experiment summary.
                                                             algorithms proved to be faster and more efficient.
We trained a TRANSFORMER base model on three ver-
sions of the MSRPARAPHRASE set: the original train set          We plan to refine the scoring with deep reinforce-
(ORIG) and the original train set extended by paraphras-     ment learning techniques and enrich the edition
ing other sentences from the same distribution with the      rules with more sophisticated patterns like phrase
MCPG ( ORIG + MCPG ) and PTS ( ORIG + PTS ) models.          permutations. Our search algorithms remain slow
                                                             for a real-time deployment: the current versions
                                                             are better suited as an offline model for data aug-
training sets achieved a significant performance
                                                             mentation. The experiment of Section 6.6 confirms
gain on BLEU, TER and METEOR.
                                                             that this application of search-based methods is
7   Conclusion                                               a promising research avenue. A planning-then-
                                                             realization hybridization like the one proposed by
We experimented with two search-based ap-                    Moryossef et al. (2019) and Fu et al. (2019) could
proaches for paraphrase generation. These ap-                also be considered for further works.
proaches are pragmatic and flexible. Being generic,
our approaches did not overfit on small datasets.            Acknowledgment
We performed extensive experiments with a rigor-
ous evaluation methodology that we applied both              We want to thank the reviewers for their construc-
on our algorithms and on the other related meth-             tive comments. This work was funded by the ANR
ods that we tried to reproduce faithfully. These             Cifre convention N◦ 2018/1559 and Orange-Labs
experiments confirm that our two methods, namely             Research.

                                                      2107
References                                                Jonathan Chevelu, Thomas Lavergne, Yves Lepage,
                                                            and Thierry Moudenc. 2009. Introduction of a new
Zhecheng An and Sicong Liu. 2019. Towards Diverse           paraphrase generation tool based on Monte-Carlo
  Paraphrase Generation Using Multi-Class Wasser-           sampling. In Proceedings of the ACL-IJCNLP 2009
  stein GAN. CoRR, abs/1909.13827.                          Conference Short Papers, pages 249–252, Suntec,
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and          Singapore. Association for Computational Linguis-
  Robert E. Schapire. 2002. The Nonstochastic Mul-          tics.
  tiarmed Bandit Problem. SIAM Journal on Comput-
  ing, 32(1):48–77.                                       Mathias Creutz. 2018. Open Subtitles Paraphrase Cor-
                                                           pus for Six Languages. In Proceedings of the
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-           Eleventh International Conference on Language Re-
  gio. 2016. Neural Machine Translation by Jointly         sources and Evaluation (LREC 2018), Miyazaki,
  Learning to Align and Translate. arXiv:1409.0473         Japan. European Language Resources Association
  [cs, stat]. ArXiv: 1409.0473.                            (ELRA).
Regina Barzilay and Kathleen R. McKeown. 2001. Ex-        Hal Daumé, John Langford, and Daniel Marcu. 2009.
  tracting paraphrases from a parallel corpus. In Pro-      Search-based structured prediction. Machine Learn-
  ceedings of the 39th Annual Meeting on Associa-           ing, 75(3):297–325.
  tion for Computational Linguistics - ACL ’01, pages
  50–57, Toulouse, France. Association for Computa-
                                                          Bill Dolan and Chris Brockett. 2005. Automatically
  tional Linguistics.
                                                             Constructing a Corpus of Sentential Paraphrases.
Peter E Brown, Vincent J Della Pietra, Stephen A Della       In Third International Workshop on Paraphrasing
  Pietra, and Robert L Mercer. 1993. The Mathemat-          (IWP2005). Asia Federation of Natural Language
  ics of Statistical Machine Translation: Parameter Es-      Processing.
  timation. Computational Linguistics, 19(2):50.
                                                          Sergey Edunov, Myle Ott, Michael Auli, and David
Chris Callison-Burch. 2008. Syntactic Constraints on        Grangier. 2018. Understanding Back-Translation at
  Paraphrases Extracted from Parallel Corpora. In           Scale. In Proceedings of the 2018 Conference on
  Proceedings of the 2008 Conference on Empirical           Empirical Methods in Natural Language Processing,
  Methods in Natural Language Processing, pages             pages 489–500, Brussels, Belgium. Association for
  196–205, Honolulu, Hawaii. Association for Com-           Computational Linguistics.
  putational Linguistics.
                                                          Elozino Egonmwan and Yllias Chali. 2019a. Trans-
Chris Callison-Burch, Philipp Koehn, and Miles Os-
                                                            former and seq2seq model for Paraphrase Genera-
  borne. 2006. Improved Statistical Machine Trans-
                                                            tion. In Proceedings of the 3rd Workshop on Neural
  lation Using Paraphrases. In Proceedings of the
                                                            Generation and Translation, pages 249–255, Hong
  Human Language Technology Conference of the
                                                            Kong. Association for Computational Linguistics.
  NAACL, Main Conference, pages 17–24, New York
  City, USA. Association for Computational Linguis-
  tics.                                                   Elozino Egonmwan and Yllias Chali. 2019b. Trans-
                                                            former and seq2seq model for Paraphrase Genera-
Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li.          tion. In Proceedings of the 3rd Workshop on Neural
  2017. Joint Copying and Restricted Generation for         Generation and Translation, pages 249–255, Hong
  Paraphrase. In Proceedings of the Thirty-First AAAI       Kong. Association for Computational Linguistics.
  Conference on Artificial Intelligence, February 4-9,
  2017, San Francisco, California, USA, pages 3152–       Yao Fu, Yansong Feng, and John P Cunningham.
  3158.                                                     2019. Paraphrase Generation with Latent Bag of
                                                            Words. In H. Wallach, H. Larochelle, A. Beygelz-
John Carroll, Guido Minnen, Darren Pearce, Yvonne           imer, F. d\textquotesingle Alché-Buc, E. Fox, and
  Canning, Siobhan Devlin, and John Tait. 1999. Sim-        R. Garnett, editors, Advances in Neural Information
  plifying text for language-impaired readers. In           Processing Systems 32, pages 13645–13656. Curran
  Ninth Conference of the European Chapter of the           Associates, Inc.
  Association for Computational Linguistics, pages
  269–270, Bergen, Norway. Association for Compu-         Yao Fu, Yansong Feng, and John P. Cunningham. 2020.
  tational Linguistics.                                     Paraphrase Generation with Latent Bag of Words.
R. Chandrasekar and B. Srinivas. 1997. Automatic in-        arXiv:2001.01941 [cs]. ArXiv: 2001.01941.
   duction of rules for text simplification. Knowledge-
   Based Systems, 10(3):183–190.                          Juri Ganitkevitch, Benjamin Van Durme, and Chris
                                                             Callison-Burch. 2013.     PPDB: The Paraphrase
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-                 Database. In Proceedings of the 2013 Conference of
  ishna Vedantam, Saurabh Gupta, Piotr Dollar, and           the North American Chapter of the Association for
  C. Lawrence Zitnick. 2015. Microsoft COCO                  Computational Linguistics: Human Language Tech-
  Captions: Data Collection and Evaluation Server.           nologies, pages 758–764, Atlanta, Georgia. Associa-
  arXiv:1504.00325 [cs]. ArXiv: 1504.00325.                  tion for Computational Linguistics.

                                                     2108
Sylvain Gelly and David Silver. 2007. Combining on-         Processing, pages 3865–3878, Brussels, Belgium.
  line and offline knowledge in uct. In Proceedings         Association for Computational Linguistics.
  of the 24th International Conference on Machine
  Learning, ICML ’07, page 273–280, New York, NY,        Nitin Madnani and Bonnie J. Dorr. 2010. Generat-
  USA. Association for Computing Machinery.                ing Phrasal and Sentential Paraphrases: A Survey of
                                                           Data-Driven Methods. Computational Linguistics,
Tommi Gröndahl and N. Asokan. 2019a. Effective            36(3):341–387.
  writing style imitation via combinatorial paraphras-
  ing. CoRR, abs/1905.13464.                             Kathleen R. McKeown. 1979. Paraphrasing Using
                                                           Given and New Information in a Question-Answer
Tommi Gröndahl and N. Asokan. 2019b. Effective            System. In 17th Annual Meeting of the Associa-
  writing style imitation via combinatorial paraphras-     tion for Computational Linguistics, pages 67–72, La
  ing. CoRR, abs/1905.13464.                               Jolla, California, USA. Association for Computa-
                                                           tional Linguistics.
Ankush Gupta, Arvind Agarwal, Prawaan Singh, and
  Piyush Rai. 2017. A Deep Generative Framework          Marie Meteer and Varda Shaked. 1988. STRATEGIES
  for Paraphrase Generation. arXiv:1709.05074 [cs].       FOR EFFECTIVE PARAPHRASING. In Coling
  ArXiv: 1709.05074.                                      Budapest 1988 Volume 2: International Conference
                                                          on Computational Linguistics.
Ankush Gupta, Arvind Agarwal, Prawaan Singh, and
  Piyush Rai. 2018. A Deep Generative Framework          Nicholas Metropolis, Arianna W. Rosenbluth, Mar-
  for Paraphrase Generation. In Thirty-Second AAAI         shall N. Rosenbluth, Augusta H. Teller, and Edward
  Conference on Artificial Intelligence.                   Teller. 2004. Equation of State Calculations by Fast
Sanda Harabagiu and Andrew Hickl. 2006. Methods            Computing Machines. The Journal of Chemical
  for Using Textual Entailment in Open-Domain Ques-        Physics, 21(6):1087. Publisher: American Institute
  tion Answering. In Proceedings of the 21st Interna-      of PhysicsAIP.
  tional Conference on Computational Linguistics and     Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and
  44th Annual Meeting of the Association for Compu-        Lei Li. 2018a.       CGMH: Constrained Sen-
  tational Linguistics, pages 905–912, Sydney, Aus-        tence Generation by Metropolis-Hastings Sam-
  tralia. Association for Computational Linguistics.       pling. arXiv:1811.10996 [cs, math, stat]. ArXiv:
Shaohan Huang, Yu Wu, Furu Wei, and Zhongzhi               1811.10996.
  Luan. 2019. Dictionary-Guided Editing Networks
                                                         Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and
  for Paraphrase Generation. Proceedings of the
  AAAI Conference on Artificial Intelligence, 33:6546–     Lei Li. 2018b.       CGMH: Constrained Sen-
  6553.                                                    tence Generation by Metropolis-Hastings Sam-
                                                           pling. arXiv:1811.10996 [cs, math, stat]. ArXiv:
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke          1811.10996.
 Zettlemoyer. 2018.      Adversarial Example Gen-
 eration with Syntactically Controlled Paraphrase        Amit Moryossef, Yoav Goldberg, and Ido Dagan.
 Networks.       arXiv:1804.06059 [cs].     ArXiv:        2019. Step-by-Step: Separating Planning from
 1804.06059.                                              Realization in Neural Data-to-Text Generation.
                                                          arXiv:1904.03396 [cs]. ArXiv: 1904.03396.
Levente Kocsis and Csaba Szepesvári. 2006. Bandit
  Based Monte-carlo Planning. In Proceedings of          Daniel Naber. 2003. LanguageTool: A Rule-Based
  the 17th European Conference on Machine Learn-           Style and Grammar Checker.
  ing, ECML’06, pages 282–293, Berlin, Heidelberg.
  Springer-Verlag. Event-place: Berlin, Germany.         Joseph Olive, Caitlin Christianson, and John McCary,
                                                            editors. 2011. Handbook of Natural Language Pro-
Kornél Csernai. 2017. First Quora Dataset Release:         cessing and Machine Translation: DARPA Global
  Question Pairs - Data @ Quora - Quora.                   Autonomous Language Exploitation.         Springer-
                                                           Verlag, New York.
Kaori Kumagai, Ichiro Kobayashi, Daichi Mochihashi,
  Hideki Asoh, Tomoaki Nakamura, and Takayuki Na-        Myle Ott, Sergey Edunov, Alexei Baevski, Angela
  gai. 2016. Human-like Natural Language Genera-          Fan, Sam Gross, Nathan Ng, David Grangier, and
  tion Using Monte Carlo Tree Search. In Proceed-         Michael Auli. 2019. fairseq: A fast, extensible
  ings of the INLG 2016 Workshop on Computational         toolkit for sequence modeling. In Proceedings of
  Creativity in Natural Language Generation, pages        NAACL-HLT 2019: Demonstrations.
  11–18, Edinburgh, UK. Association for Computa-
  tional Linguistics.                                    Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,
                                                            Benjamin Van Durme, and Chris Callison-Burch.
Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li.            2015. PPDB 2.0: Better paraphrase ranking, fine-
  2018. Paraphrase Generation with Deep Reinforce-          grained entailment relations, word embeddings, and
  ment Learning. In Proceedings of the 2018 Con-            style classification. In Proceedings of the 53rd An-
  ference on Empirical Methods in Natural Language          nual Meeting of the Association for Computational

                                                     2109
Linguistics and the 7th International Joint Confer-      deep neural networks and tree search.       Nature,
  ence on Natural Language Processing (Volume 2:           529(7587):484–489.
  Short Papers), pages 425–430, Beijing, China. As-
  sociation for Computational Linguistics.               Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                           Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Matt Post. 2018. A Call for Clarity in Reporting BLEU      Kaiser, and Illia Polosukhin. 2017. Attention Is
 Scores. In Proceedings of the Third Conference on         All You Need. arXiv:1706.03762 [cs]. ArXiv:
 Machine Translation: Research Papers, pages 186–          1706.03762.
 191, Belgium, Brussels. Association for Computa-
 tional Linguistics.                                     Yaushian Wang and Hung-Yi Lee. 2018. Learning
                                                           to Encode Text as Human-Readable Summaries us-
Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek          ing Generative Adversarial Networks. In Proceed-
  Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri.     ings of the 2018 Conference on Empirical Methods
  2016a. Neural paraphrase generation with stacked         in Natural Language Processing, pages 4187–4195,
  residual LSTM networks. In Proceedings of COL-           Brussels, Belgium. Association for Computational
  ING 2016, the 26th International Conference on           Linguistics.
  Computational Linguistics: Technical Papers, pages
                                                         Sam Witteveen and Martin Andrews. 2019. Paraphras-
  2923–2934, Osaka, Japan. The COLING 2016 Orga-
                                                           ing with Large Language Models. In Proceedings of
  nizing Committee.
                                                           the 3rd Workshop on Neural Generation and Trans-
Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek           lation, pages 215–220, Hong Kong. Association for
  Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri.     Computational Linguistics.
  2016b. Neural Paraphrase Generation with Stacked       Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze
  Residual LSTM Networks. COLING, page 12.                 Chen, and Chris Callison-Burch. 2016. Optimizing
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,         Statistical Machine Translation for Text Simplifica-
  Dario Amodei, and Ilya Sutskever. 2019. Language         tion. Transactions of the Association for Computa-
  models are unsupervised multitask learners. OpenAI       tional Linguistics, 4:401–415.
  Blog, 1(8):9.                                          Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming
                                                           Zhou, Zhoujun Li, and Jianshe Zhou. 2016. Doc-
Aurko Roy and David Grangier. 2019. Unsupervised           Chat: An Information Retrieval Approach for Chat-
  Paraphrasing without Translation. In Proceedings of      bot Engines Using Unstructured Documents. In Pro-
  the 57th Annual Meeting of the Association for Com-      ceedings of the 54th Annual Meeting of the Associa-
  putational Linguistics, pages 6033–6039, Florence,       tion for Computational Linguistics (Volume 1: Long
  Italy. Association for Computational Linguistics.        Papers), pages 516–525, Berlin, Germany. Associa-
Tobias Schwartz and Diedrich Wolter. 2018. A Variant       tion for Computational Linguistics.
  of Monte-Carlo Tree Search for Referring Expres-       Qian Yang, Dinghan Shen, Yong Cheng, Wenlin Wang,
  sion Generation. In KI 2018: Advances in Artificial      Guoyin Wang, Lawrence Carin, et al. 2019a. An
  Intelligence, pages 315–326. Springer, Cham.             end-to-end generative architecture for paraphrase
                                                           generation. In Proceedings of the 2019 Conference
Rico Sennrich, Barry Haddow, and Alexandra Birch.
                                                           on Empirical Methods in Natural Language Process-
  2016a. Improving neural machine translation mod-
                                                           ing and the 9th International Joint Conference on
  els with monolingual data. In Proceedings of the
                                                           Natural Language Processing (EMNLP-IJCNLP),
  54th Annual Meeting of the Association for Compu-
                                                           pages 3123–3133.
  tational Linguistics (Volume 1: Long Papers), pages
  86–96, Berlin, Germany. Association for Computa-       Yinfei Yang, Yuan Zhang, Chris Tar, and Jason
  tional Linguistics.                                      Baldridge. 2019b. PAWS-X: A Cross-lingual Ad-
                                                           versarial Dataset for Paraphrase Identification. In
Rico Sennrich, Barry Haddow, and Alexandra Birch.          Proceedings of the 2019 Conference on Empirical
  2016b. Neural Machine Translation of Rare Words          Methods in Natural Language Processing and the
  with Subword Units. In Proceedings of the 54th An-       9th International Joint Conference on Natural Lan-
  nual Meeting of the Association for Computational        guage Processing (EMNLP-IJCNLP), pages 3678–
  Linguistics (Volume 1: Long Papers), pages 1715–         3683, Hong Kong, China. Association for Computa-
  1725, Berlin, Germany. Association for Computa-          tional Linguistics.
  tional Linguistics.
                                                         Tianyi Zhang, Varsha Kishore, Felix Wu, Kil-
David Silver, Aja Huang, Chris J. Maddison, Arthur          ian Q. Weinberger, and Yoav Artzi. 2019a.
  Guez, Laurent Sifre, George van den Driessche, Ju-        BERTScore:     Evaluating Text Generation
  lian Schrittwieser, Ioannis Antonoglou, Veda Pan-        with BERT.   arXiv:1904.09675 [cs]. ArXiv:
  neershelvam, Marc Lanctot, Sander Dieleman, Do-          1904.09675.
  minik Grewe, John Nham, Nal Kalchbrenner, Ilya
  Sutskever, Timothy Lillicrap, Madeleine Leach, Ko-     Yuan Zhang, Jason Baldridge, and Luheng He. 2019b.
  ray Kavukcuoglu, Thore Graepel, and Demis Has-           PAWS: Paraphrase Adversaries from Word Scram-
  sabis. 2016. Mastering the game of Go with               bling. In Proceedings of the 2019 Conference of

                                                    2110
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
1298–1308, Minneapolis, Minnesota. Association
for Computational Linguistics.

                                                2111
You can also read