Neural-Driven Search-Based Paraphrase Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural-Driven Search-Based Paraphrase Generation Betty Fabre Jonathan Chevelu Orange Labs, Lannion Univ Rennes, IRISA Lannion Univ Rennes, IRISA jonathan.chevelu@irisa.fr betty.fabre@orange.com Tanguy Urvoy Damien Lolive Orange Labs Lannion Univ Rennes, IRISA Lannion tanguy.urvoy@orange.com damien.lolive@irisa.fr Abstract transformer or a seq2seq model trained on a ques- tion answering corpus will typically turn any input We study a search-based paraphrase genera- sentence into a question. With the lack of generic tion scheme where candidate paraphrases are aligned datasets, it remains challenging to train generated by iterated transformations from the original sentence and evaluated in terms of generic paraphrase models in a supervised manner. syntax quality, semantic distance, and lexical On the other hand, with the availability of large- distance. The semantic distance is derived scale non-supervised language models like BERT from BERT, and the lexical quality is based on and GPT2, the assessment of a given candidate GPT2 perplexity. To solve this multi-objective paraphrase in terms of semantic distance from its search problem, we propose two algorithms: source and lexical quality has become much more Monte-Carlo Tree Search For Paraphrase Gen- tractable. eration (MCPG) and Pareto Tree Search (PTS). We provide an extensive set of experiments Leveraging from these metrics, we propose to on 5 datasets with a rigorous reproduction cast the paraphrase generation task as a multi- and validation for several state-of-the-art para- criteria search problem. We use PPDB 2.0 (Pavlick phrase generation algorithms. These experi- et al., 2015), a large-scale database of rewriting ments show that, although being non explic- rules derived from bilingual corpora, to poten- itly supervised, our algorithms perform well tially generate billions of ’naive’ candidate para- against these baselines. phrases by edition from the source sentence. To sort the good candidates efficiently from the others, 1 Introduction we experiment with two search algorithms. The Paraphrase generation, i.e. the transformation of first one, called Monte-Carlo Paraphrase Genera- a sentence into a well-formed but lexically differ- tion (MCPG), is a variant of the Monte-Carlo Tree ent one while preserving its original meaning, is Search algorithm (MCTS) (Kocsis and Szepesvári, a fundamental task of NLP. Its ability to provide 2006; Gelly and Silver, 2007; Chevelu et al., 2009). diversity and coverage finds applications in several The MCTS algorithm is famous for its successes on domains like question answering (McKeown, 1979; mastering the – highly combinatorial – game of Go Harabagiu and Hickl, 2006), machine-translation (Gelly and Silver, 2007; Silver et al., 2016). (Callison-Burch et al., 2006), dialog systems (Yan The second one is a novel search algorithm that et al., 2016), privacy (Gröndahl and Asokan, 2019a) we call Pareto Tree Search (PTS). In contrast to or adversarial learning (Iyyer et al., 2018). MCTS which is a single-criterion search algorithm, The formal definition of paraphrase may vary we designed PTS to retrieve an approximation of according to the targeted application and the tol- the whole Pareto optimal set. This allows for more erance we set along several axes, including the flexibility on paraphrase generation where the bal- semantic distance from the source that we want to ance between semantic distance, syntax quality, minimize, the quality of the syntax, and the lexical and lexical distance is hard to tune a priori. An- distance from the source that we want to maximize other difference between MCPG and PTS is that PTS to ensure diversity. uses a randomized breadth-first exploration policy The available aligned paraphrase corpora are of- which proves to be more efficient on this problem. ten biased toward specific problems like question The main contribution of this article is a study answering or image captioning. For instance, a on search-based paraphrase generation through the 2100 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2100–2111 April 19 - 23, 2021. ©2021 Association for Computational Linguistics
Source sentence : he is speaking on june 14 . By iteratively applying the rules from a source PPDB rule Edited sentence sentence like the one in Table 1, we obtain a vast is → is found he is found speaking on june 14 . is speaking → ’s talking he ’s talking on june 14 . lattice of candidate paraphrases. Some of these speaking → speak now he is speak now on june 14 . candidates like “he’s talking on june 14” are well- 14 → 14th he is speaking on june 14th . formed, but many are syntactically broken, like “he Table 1: PPDB rules applied to a source sentence is speak now on june 14”. The number of rules that apply depends on the source sentence’s size and the words it contains. sieve of three criteria: semantic similarity, syntax For instance, on the MSRPARAPHRASE dataset (see quality, and lexical distance. We propose and eval- section 6.2), sentences are quite long and the me- uate two search algorithms: MCPG and PTS. We dian number of PPDB - XL rules that apply is around also provide an extensive set of experiments on 450. After two rewriting steps, the median num- English datasets with a rigorous reproduction and ber of candidates is around 105 , and by iterative validation methodology for several state-of-the-art rewriting, we quickly reach a number of paraphrase paraphrase generation algorithms. candidates that is greater than 108 . This article is organized as follows: The can- didate paraphrase generation scheme is presented 3 Paraphrase selection criteria in Section 2. In Section 3, we develop the differ- ent criteria we use to qualify correct paraphrases. As it depends on the type of text we consider that The two search algorithms (MCPG and PTS) are may be spoken or written, casual or formal; it is described in Section 4. Section 5 gives a survey not easy to define a universal semantic distance or on other state-of-the-art paraphrase generation al- a universal scale of well formed syntax. However, gorithms. The comparisons with non-supervised recent advances in NLP with neural networks like and supervised baselines are presented in Section 6 BERT and GPT2 trained on huge corpora have where the methodology is discussed in Section 6.3. led to the development of metrics that can act as We conclude by a data-augmentation experiment good proxies for these ideal notions. in Section 6.6. For the semantic distance, a quick experiment confirms that the BERT score (Zhang et al., 2019a) 2 Paraphrase generation scheme performs well on difficult paraphrase identification tasks. The BERT score is an F1-measure over We model paraphrase generation as a sequence of an alignment of the BERT contextual word em- editions and transformations from a source sen- beddings of each of the sentences. To assess the tence into its paraphrase. In this work, we only sensitivity of this score, we computed the Area Un- consider local transformations, i.e. replacement der the ROC curve (AUC) on QQP and PAWS, two of certain words or group of words by others that difficult paraphrase identification corpora (Kornél have the same or similar meanings, but the method Csernai, 2017; Zhang et al., 2019b). On QQP, we we propose should work with more sophisticated obtained 75.2% and on PAWS, we obtained 67.0%. transformations as well. The PAWS corpus being designed to trick para- The Paraphrase Database (PPDB) (Ganitkevitch phrase identification classifiers, 67.0% is a reason- et al., 2013; Pavlick et al., 2015) is a large collec- able performance. We hence opted for the BERT tion of scored paraphrase rules that was automat- score between the source sentence and paraphrase ically constructed from various bilingual corpora candidate (denoted BERTS ) as our semantic score. using a pivot alignment method (Callison-Burch, Regarding the syntax quality, the perplexity of 2008). The database is divided into increasingly GPT2 (Radford et al., 2019) is a good ranking cri- large and decreasingly accurate subsets1 . We used terion. Although, as illustrated in Table 2, in some the XL subset, and we removed the rules labeled cases, a rule-based spell-checker may detect errors as “Independent”. This left us with a set of 5.5 that GPT2 would miss (but the reverse is also true). million rewriting rules. We give some examples of We hence opted for GPT2 as a primary criterion these rules in Table 1. for syntax quality, combined with the L ANGUAGE - 1 T OOL spell-checker (Naber, 2003) that we only PPDB is available on http://paraphrase.org, and a python interface is available here https://github. used on a second stage for performance reasons. com/erickrf/ppdb The lexical distance is important to ensure the 2101
diversity of the produced paraphrases. It is however order to obtain a balance as the one described in simple to handle. Some authors use the BLEU sur- Table 2. face metric (Miao et al., 2018a), we opted here for the normalized character-level Levenshtein edition 4.2 PTS: Pareto Tree Search in the distance. paraphrase lattice The balance between these criteria is difficult We observed two drawbacks for MCTS. to obtain. Table 2 illustrates their impact on a sen- First, it was designed for combinatorial prob- tence taken from the train set of MSRPARAPHRASE. lems like Go where the evaluation is only possible The candidate examples in this table underline the on the leaves of the search tree. This is not the tough dilemma between maximizing the semantic case for paraphrase generation where the neural similarity (safe and conservative policy) and maxi- models can evaluate any rewriting step and where mizing the lexical distance (risk-prone). The third rewriting from good candidates is more likely to and fourth examples underline the utility of the provide good paraphrases than rewriting from bad spell-checker: some low-perplexity examples are ones. Secondly, it has been designed for single ill-formed. On the second part of the table, we criterion search which requires fixing the balance printed the sentences chosen by our two models: between criteria definitively before any paraphrase MCPG and PTS . search begins. This is not very flexible, and it be- 4 Searching algorithms comes painful when we want to generate sets of candidates. Searching for good paraphrases in the large lat- By plotting the distributions of the scores like tice of candidates generated by PPDB is a costly on Figure 1, we noticed that most of the candidates task. We propose two algorithms that share a simi- were dominated in the Pareto sense: it was possible lar structure: an outer loop explores the lattice at to eliminate most of the candidates without any different depths, while an inner loop explores the hyper-parameter tuning. Hence, we adapted MCPG candidates at each depth. Both algorithms are any- to explore the paraphrase lattice and recover an ap- time: they return the best solutions found so far proximation of the Pareto front, postponing the bal- when the time or space budget is depleted. ance between criteria as a quick post-optimization stage. This led us to the PTS algorithm described 4.1 MCPG: Monte-Carlo tree search for as pseudo-code in Table 3. Paraphrase Generation Following the idea of Chevelu et al. (2009), we 1.00 used Monte-Carlo Tree Search (MCTS) to explore Score isolines the PPDB lattice. The three key ingredients of MCTS 0.98 are: a bandit policy at each node of a search-tree to 0.96 Pareto-Front select the most promising paths, randomized roll- 0 BERT Score 0.94 1 outs to estimate the quality of these paths, and back- -GPT2 propagation of rewards along the paths to update 0.92 160 140 the bandit. We opted here for a randomized bandit 120 0.90 100 policy called EXP 3 (Auer et al., 2002). The MCTS 80 algorithm being not designed for multi-objective 0.88 problems, we needed to combine semantic similar- 0.86 ity BERTS , syntax correctness GPT2 and surface 0.0 0.1 0.2 0.3 0.4 0.5 Levhenstein Distance diversity LevS into a single criterion. We opted for the following polynomial: Figure 1: The cloud of candidates generated from the sample of Table 2. The optima of any positive combi- α · BERTS + β · LevS · BERTS − γ · GPT2 (1) nation of BERT score, normalized Levenshtein distance, and GPT 2 perplexity belong to the Pareto front (orange where the product LevS · BERTS is intended to dots). We plotted the projections of MCPG combined avoid a trivial maximization of the score by apply- score (1) with dashed isolines. The BERT score and ing a lot of editions to the source sentence. After a Levenshtein distance being clearly anti-correlated, the few experiments on train sets, we tuned empirically balance between these two criteria is difficult to tune. the weights to α = 3, β = 0.5 and γ = 0.025 in 2102
Source the agency could not say when the tape was made , though the voice says he is speaking on june 14 . Policy Candidate Sentence BERTS GPT2 LevS Errs high BERT the agency could not say when the tape was made , though 0.99 4.23 0.04 0 conservative the voice says he is talking on june 14 . the organizations and agencies could just ’m saying when- high Lev ever the tape-based was provided presentation there are , 0.60 7.83 1.0 0 risk prone however the express opinions informed that he was found talking pertaining end-june fourteen . the agencies was not able say when the tape been given high GPT2 currently undertaking though the voice indicated he 0.82 7.10 0.55 2 bad syntax talking on june 14 . the organisation are incapable of ’re saying when the tape spell & grammar is set out , despite the fact that the voice just said he have 0.73 5.49 0.79 2 errors a conversation on june 14 . the organization could not say when the tape was made , balanced 0.95 4.34 0.28 0 although the voice indicates that he is talking on june 14 . the organization could not say when the tape was made , MCPG output 0.96 4.36 0.26 0 though the voice indicates that he is talking on june 14 . the agency could not say when the tape was made , PTS output 1.00 4.17 0.02 0 although the voice says he is speaking on june 14 . the agency could put no exact date on the tape , though Target 0.87 4.77 0.22 0 the voice says he is speaking on june 14 . Table 2: An example of a source sentence sampled from the MSRPARAPHRASE train set with some representative candidates from its PPDB rewriting graph. For each of the candidates, we computed the BERT score with respect to the source (BERTS ), the normalized GPT2 perplexity (GPT2), the Levenshtein distance from the source sentence (LevS ), and the number of spell and grammar errors detected (Errs). The PPDB editions are highlighted in green. The detected spell and grammar errors are highlighted in purple. The first candidate maximizes the semantic similarity (BERT score) and is very conservative. The second one maximizes the surface diversity (LevS ) and takes a lot of risks. The third one shows an ill-formed paraphrase candidate. The fourth candidate emphasizes the utility of the spell-checker. The last candidate, on the fifth row, achieves our equilibrium goal. We show on a second part, the paraphrases generated by our models MCPG an PTS, and on a third part, we give the reference paraphrase from the dataset. L function PTS(input sentence) substitution rules can be extracted by sub-sentence candidates ← REWRITE(input sentence) depth ← 1 alignment algorithms from parallel corpora (Brown while time/space < budget do et al., 1993). Building such a dedicated corpus while time/space < layer-budget do batch ← SAMPLE(candidates) being a long and costly task, one usually trans- scored ← scored ∪ NN. SCORES(batch) forms other corpora through a “pivot represen- end while tation” (Barzilay and McKeown, 2001; Callison- layer front set ← PARETO - FRONT(scored) candidates ← REWRITE(layer front set) Burch et al., 2006; Ganitkevitch et al., 2013; Chen depth ← depth + 1 et al., 2015). The weakness of these approaches is end while that phrase-level rewriting rules alone are not able return PARETO - FRONT(all scored nodes) end function to build coherent sentences. A typical data-driven paraphrase generator used to be a mixture of poten- Table 3: Pareto Tree Search (PTS) algorithm tially noisy handcrafted and data-driven rewriting rules coupled with a score that had to be optimized 5 Related work in real-time through dynamic programming. How- ever, dynamic programming methods like Viterbi Rule-based and statistical approaches Follow- are constrained by the requirement of a score that ing the path of machine translation, the para- decomposes into a sum of word-level or phrase- phrase generation literature first evolved from la- level criteria (Xu et al., 2016). Some attempts were boriously handcrafted linguistic rules (McKeown, made to relax this constraint with search-based ap- 1979; Meteer and Shaked, 1988; Chandrasekar and proaches (Chevelu et al., 2009; Daumé et al., 2009), Srinivas, 1997; Carroll et al., 1999) to more au- but the global optimized criteria were simplistic, tomatic and data-driven rules extraction methods and the obtained solutions were not suitable for (Callison-Burch et al., 2006; Madnani and Dorr, practical deployment. 2010). Like in machine translation, phrase-level 2103
Supervised encoder-decoder approaches Like 2019b) they deployed a search-based policy for machine translation, paraphrase generation bene- style imitation, Schwartz and Wolter (2018) and fited from deep neural networks and evolved to effi- Kumagai et al. (2016) both used MCTS for text cient end-to-end architectures that can both learn to generation. Following the same trend, Miao et al. align and translate (Bahdanau et al., 2016; Vaswani (2018a) proposed to use Metropolis-Hasting sam- et al., 2017). Several papers like (Prakash et al., pling (Metropolis et al., 2004) for constrained sen- 2016a; Cao et al., 2017) set the paraphrase gener- tence generation in an algorithm called CGMH. ation task as a supervised sequence-to-sequence Starting from the source sentence, the CGMH problem. As confirmed by our experiments in Sec- algorithm samples a sequence of sentences by us- tion 6.4, this approach is efficient for specific types ing local editions: word replacement, deletion, and of paraphrases. It is also able to produce relatively insertion. For paraphrase generation, CGMH con- long-range transformations, but it requires huge straints the sentence generation using a matching and high-quality sentence-level aligned datasets for function that combines a measure of semantic simi- training. larity and a measure of English fluency. This model The paraphrase generation literature mostly re- is therefore directly comparable with our MCPG and ports results on MSCOCO (Chen et al., 2015) and PTS approaches. QQP (Kornél Csernai, 2017) datasets which are built respectively from image captions and semi- 6 Experiments duplicate questions. These datasets are very spe- The evaluation metrics and datasets are described cific: MSCOCO is strongly biased toward image respectively in Section 6.1 and 6.2. We paid at- description sentences, and QQP is dedicated to ques- tention to set up a rigorous validation protocol for tions. The Transformer model we trained on QQP our experiments. The reproducibility and method- typically transforms any input into a question. For ology issues that we faced are discussed in Sec- instance, from “He is speaking on june 14.” it gives tion 6.3. We compare our model with state-of-the- “Who is speaking on june 14?”. art supervised methods and with CGMH, another Generative and hybrid approaches Recent ap- search-based algorithm. The technical details on proaches rely on conditional generative models like these algorithms are developed in Section 6.4. The CVAE to allow the generation of paraphrase sets and results are detailed in Section 6.5. palliate the lack of supervision data (Gupta et al., 6.1 Evaluation metrics 2018; Yang et al., 2019a; Roy and Grangier, 2019; An and Liu, 2019). Others make use of CGANs We rely on standard machine translation metrics combined with reinforcement learning (Li et al., that compare the generated paraphrase to one or 2018; Wang and Lee, 2018). Witteveen and An- several ground-truth references (Olive et al., 2011). drews (2019) propose to simply fine-tune a large We report surface metrics BLEU and TER and se- non-supervised model like GPT2. mantic metrics METEOR and BERT2 , the average Recent text-generation architectures like the one BERT score of the generated sentence with respect proposed in (Moryossef et al., 2019; Fu et al., 2019) to the reference sentences3 . make a clear separation between a rule-based plan- 6.2 Evaluation datasets ning phase and the neural realization. A similar idea was tested by Huang et al. (2019) where PPDB Table 4 gives a summary of the datasets we used. rules are used to control the decoder of a CVAE. On one hand, we have large datasets like MSCOCO or OPUSPARCUS (OPUSPAR .) that are noisy and Search-based approaches Search-based meth- very specific. On the other hand, we have MSR - ods regained interest in the text generation commu- PARAPHRASE ( MSRPARA .), a high-quality but nity for several reasons, including the need for flex- small human-labeled dataset. ibility and the fact that with deep neural-networks, 2 This is the same metric, but this usage of BERT score the search evaluation criteria have become more against reference must not be confused with the BERT score reliable. These methods are often slower than auto- against source (BERTS ) that we use in (1). 3 regressive text generation methods, but it is always We used the scripts from github.com/jhclark/ multeval and github.com/Tiiiger/bert_score. possible to distillate the models into faster ones like For the BERT score we used ’bert-base-uncased L8 no- we do in Section 6.6. In (Gröndahl and Asokan, idf version=0.1.2’ 2104
Corpus Size Len B>0.75 L
6.4 Baseline systems implementation CGMH . Overall, these results are mixed: it is how- In the next subsections, we present the re- ever important to keep in mind that contrary to the implemented encoder-decoder neural networks ar- supervised baselines which are retrained for each chitectures and the weakly-supervised paraphrase dataset, the parameters of the CGMH, MCPG and PTS models are left unchanged. generator used as baselines in our experiments. On the MSCOCO and QQP datasets, the super- Supervised paraphrase generators As super- vised baselines achieve clearly better results, but vised baselines, we trained three neural network ar- MCPG and PTS achieve better results on OPUSPAR - chitectures that were previously reported to achieve CUS and PAWS except with the BERT score for good results on MSCOCO and QQP, in particular, the which the TRANSFORMER model achieves simi- Seq2Seq architecture, a Residual LSTM architec- lar results. On MSRPARAPHRASE, the encoder- ture (Prakash et al., 2016b) and a TRANSFORMER decoder neural networks models perform poorly. model (Egonmwan and Chali, 2019a). We ex- This result can be explained by the small number tended the experiments to the other aligned corpora: of training examples available on this corpus (See MSRPARAPHRASE , OPUSPARCUS and PAWS . Table 4). On the weakly-supervised side, MCPG To be more precise, we trained a 4-layers LSTM and PTS models outperform the CGMH baseline on Seq2Seq with a bidirectional encoder and decoder all corpora except on the MSCOCO dataset where using attention. This architecture is reported as the results are similar. S EQ 2S EQ in the results. We trained a 4-layer Resid- These results prove that even without a special- ual LSTM Seq2Seq as introduced by Prakash et al. ized training sets, generic search-based methods (2016b) and reproduced by Fu et al. (2019). This are competitive for paraphrase generation. How- architecture is reported as R ESIDUAL LSTM in the ever, it is a fact that encoder-decoder networks have results. The results we obtained with this model excellent performances for text generation and have are close to the ones reported by Fu et al. (2019). the potential to generate more complex paraphrases Finally, we trained a TRANSFORMER using than those obtained by simple local transformations the transformer base hyper-parameters set from as in our models. (Vaswani et al., 2017). This architecture is reported Training a general – all-purpose – paraphrase as TRANSFORMER BASE in the results. generation network would require a huge volume For all the encoder-decoder experiments, we of data. And there is yet much less aligned corpora used the fairseq framework (Ott et al., 2019) that available for paraphrase than for translation. implements the SEQ 2 SEQ and TRANSFORMER ar- chitectures. We added our own implementation of 6.6 Data-augmentation experiment the RESIDUAL LSTM architecture. In order to get the best of both worlds, one option For preprocessing, we used Moses tokenizer and is to enrich the training set of a TRANSFORMER subword segmentation following Sennrich et al. with the results of a search-based method. (2016b) and using the subword-nmt library . The To test this idea, we used our models to aug- maximum sentence length is set to 1024 tokens ment the MSRPARAPHRASE training set. For that which the default setting in fairseq. For decoding, purpose we created new pairs of paraphrases from we did a beam search with a beam of size 5. unused sentences of MSRPARAPHRASE (the pairs Weakly-supervised paraphrase generator For labeled as “not paraphrases”) using MCPG and PTS. the weakly-supervised strategy CGMH introduced We then trained a new supervised T RANSFORMER by Miao et al. (2018b) we used the official code. models on the augmented training sets. We managed to reproduce their results on QQP. Having no guarantee that our models generate On our test set we achieve a BLEU score of 22.5 syntactically perfect sentences, we inverted the while they reported 18.8. We then extended the pairs, thus taking the generated paraphrases as in- experiment to other datasets and metrics. put and the dataset’s sentences as output. This trick, called back-translation, forces the target model to 6.5 Results generate correct sentences (Sennrich et al., 2016a; Table 6 summarizes the results of the comparison Edunov et al., 2018). of our models, supervised encoder-decoder neu- We report the results of this experiment in Ta- ral networks and the weakly-supervised method ble 7. The models trained with the augmented 2106
Corpus Model BLEU ↑ T ER ↓ M ET EOR ↑ BERT ↑ S EQ 2S EQ 27.5 62.3 24.3 0.76 R ESIDUAL LSTM 26.9 63.3 24.2 0.76 T RANSFORMER BASE 26.9 63.3 24.2 0.76 MSCOCO CGMH 17.3 72.6 21.9 0.7 MCPG 16.5 73.5 23.2 0.71 PTS 17.0 69.9 22.8 0.64 S EQ 2S EQ 29.2 60.3 30.7 0.8 R ESIDUAL LSTM 28.4 59.1 30.2 0.8 T RANSFORMER BASE 29.1 59.5 30.5 0.8 QQP CGMH 22.5 65.0 27.0 0.72 MCPG 24.1 64.5 31.8 0.78 PTS 25.6 58.7 31.4 0.78 S EQ 2S EQ 8.4 79.3 13.8 0.69 R ESIDUAL LSTM 8.1 78.6 14.3 0.7 T RANSFORMER BASE 8.1 84.3 13.9 0.7 OPUSPARCUS CGMH 7.6 78.9 16.8 0.58 MCPG 9.6 78.6 23.3 0.67 PTS 9.1 70.2 22.1 0.66 S EQ 2S EQ 44.0 36.8 39.2 0.92 R ESIDUAL LSTM 43.6 37.1 38.9 0.92 T RANSFORMER BASE 42.4 37.6 38.9 0.92 PAWS CGMH 15.4 58.1 20.7 0.61 MCPG 55.5 24.3 49.2 0.93 PTS 57.9 21.9 48.5 0.92 S EQ 2S EQ 11.6 89.5 12.6 0.53 R ESIDUAL LSTM 10.5 93.7 11.2 0.52 T RANSFORMER BASE 20.7 76.7 21.6 0.65 MSRPARAPHRASE CGMH 9.7 72.9 15.4 0.48 MCPG 39.3 52.4 37.2 0.81 PTS 40.3 48.4 36.1 0.80 Table 6: Experiments summary. Symbol ’↑’ means that higher value is better. Significantly best values are marked in bold. MCPG and PTS outperform state-of-the-art models on OPUSPAR ., PAWS and MSRPARA .. Contrary to the supervised baselines, the parameters of the MCPG and PTS models are left unchanged for each dataset. Train set BLEU T ER M ET EOR BERT MCPG and PTS, are comparable to supervised state- ORIG . 20.7 76.7 21.6 0.65 ORIG . + MCPG 25.3 69.6 24.5 0.63 of-the-art baselines despite being less tightly su- ORIG . + PTS 25.3 71.1 24.6 0.63 pervised. When compared with CGMH, another search-based and weakly-supervised method, our Table 7: Data-augmentation experiment summary. algorithms proved to be faster and more efficient. We trained a TRANSFORMER base model on three ver- sions of the MSRPARAPHRASE set: the original train set We plan to refine the scoring with deep reinforce- (ORIG) and the original train set extended by paraphras- ment learning techniques and enrich the edition ing other sentences from the same distribution with the rules with more sophisticated patterns like phrase MCPG ( ORIG + MCPG ) and PTS ( ORIG + PTS ) models. permutations. Our search algorithms remain slow for a real-time deployment: the current versions are better suited as an offline model for data aug- training sets achieved a significant performance mentation. The experiment of Section 6.6 confirms gain on BLEU, TER and METEOR. that this application of search-based methods is 7 Conclusion a promising research avenue. A planning-then- realization hybridization like the one proposed by We experimented with two search-based ap- Moryossef et al. (2019) and Fu et al. (2019) could proaches for paraphrase generation. These ap- also be considered for further works. proaches are pragmatic and flexible. Being generic, our approaches did not overfit on small datasets. Acknowledgment We performed extensive experiments with a rigor- ous evaluation methodology that we applied both We want to thank the reviewers for their construc- on our algorithms and on the other related meth- tive comments. This work was funded by the ANR ods that we tried to reproduce faithfully. These Cifre convention N◦ 2018/1559 and Orange-Labs experiments confirm that our two methods, namely Research. 2107
References Jonathan Chevelu, Thomas Lavergne, Yves Lepage, and Thierry Moudenc. 2009. Introduction of a new Zhecheng An and Sicong Liu. 2019. Towards Diverse paraphrase generation tool based on Monte-Carlo Paraphrase Generation Using Multi-Class Wasser- sampling. In Proceedings of the ACL-IJCNLP 2009 stein GAN. CoRR, abs/1909.13827. Conference Short Papers, pages 249–252, Suntec, Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Singapore. Association for Computational Linguis- Robert E. Schapire. 2002. The Nonstochastic Mul- tics. tiarmed Bandit Problem. SIAM Journal on Comput- ing, 32(1):48–77. Mathias Creutz. 2018. Open Subtitles Paraphrase Cor- pus for Six Languages. In Proceedings of the Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Eleventh International Conference on Language Re- gio. 2016. Neural Machine Translation by Jointly sources and Evaluation (LREC 2018), Miyazaki, Learning to Align and Translate. arXiv:1409.0473 Japan. European Language Resources Association [cs, stat]. ArXiv: 1409.0473. (ELRA). Regina Barzilay and Kathleen R. McKeown. 2001. Ex- Hal Daumé, John Langford, and Daniel Marcu. 2009. tracting paraphrases from a parallel corpus. In Pro- Search-based structured prediction. Machine Learn- ceedings of the 39th Annual Meeting on Associa- ing, 75(3):297–325. tion for Computational Linguistics - ACL ’01, pages 50–57, Toulouse, France. Association for Computa- Bill Dolan and Chris Brockett. 2005. Automatically tional Linguistics. Constructing a Corpus of Sentential Paraphrases. Peter E Brown, Vincent J Della Pietra, Stephen A Della In Third International Workshop on Paraphrasing Pietra, and Robert L Mercer. 1993. The Mathemat- (IWP2005). Asia Federation of Natural Language ics of Statistical Machine Translation: Parameter Es- Processing. timation. Computational Linguistics, 19(2):50. Sergey Edunov, Myle Ott, Michael Auli, and David Chris Callison-Burch. 2008. Syntactic Constraints on Grangier. 2018. Understanding Back-Translation at Paraphrases Extracted from Parallel Corpora. In Scale. In Proceedings of the 2018 Conference on Proceedings of the 2008 Conference on Empirical Empirical Methods in Natural Language Processing, Methods in Natural Language Processing, pages pages 489–500, Brussels, Belgium. Association for 196–205, Honolulu, Hawaii. Association for Com- Computational Linguistics. putational Linguistics. Elozino Egonmwan and Yllias Chali. 2019a. Trans- Chris Callison-Burch, Philipp Koehn, and Miles Os- former and seq2seq model for Paraphrase Genera- borne. 2006. Improved Statistical Machine Trans- tion. In Proceedings of the 3rd Workshop on Neural lation Using Paraphrases. In Proceedings of the Generation and Translation, pages 249–255, Hong Human Language Technology Conference of the Kong. Association for Computational Linguistics. NAACL, Main Conference, pages 17–24, New York City, USA. Association for Computational Linguis- tics. Elozino Egonmwan and Yllias Chali. 2019b. Trans- former and seq2seq model for Paraphrase Genera- Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. tion. In Proceedings of the 3rd Workshop on Neural 2017. Joint Copying and Restricted Generation for Generation and Translation, pages 249–255, Hong Paraphrase. In Proceedings of the Thirty-First AAAI Kong. Association for Computational Linguistics. Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 3152– Yao Fu, Yansong Feng, and John P Cunningham. 3158. 2019. Paraphrase Generation with Latent Bag of Words. In H. Wallach, H. Larochelle, A. Beygelz- John Carroll, Guido Minnen, Darren Pearce, Yvonne imer, F. d\textquotesingle Alché-Buc, E. Fox, and Canning, Siobhan Devlin, and John Tait. 1999. Sim- R. Garnett, editors, Advances in Neural Information plifying text for language-impaired readers. In Processing Systems 32, pages 13645–13656. Curran Ninth Conference of the European Chapter of the Associates, Inc. Association for Computational Linguistics, pages 269–270, Bergen, Norway. Association for Compu- Yao Fu, Yansong Feng, and John P. Cunningham. 2020. tational Linguistics. Paraphrase Generation with Latent Bag of Words. R. Chandrasekar and B. Srinivas. 1997. Automatic in- arXiv:2001.01941 [cs]. ArXiv: 2001.01941. duction of rules for text simplification. Knowledge- Based Systems, 10(3):183–190. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Database. In Proceedings of the 2013 Conference of ishna Vedantam, Saurabh Gupta, Piotr Dollar, and the North American Chapter of the Association for C. Lawrence Zitnick. 2015. Microsoft COCO Computational Linguistics: Human Language Tech- Captions: Data Collection and Evaluation Server. nologies, pages 758–764, Atlanta, Georgia. Associa- arXiv:1504.00325 [cs]. ArXiv: 1504.00325. tion for Computational Linguistics. 2108
Sylvain Gelly and David Silver. 2007. Combining on- Processing, pages 3865–3878, Brussels, Belgium. line and offline knowledge in uct. In Proceedings Association for Computational Linguistics. of the 24th International Conference on Machine Learning, ICML ’07, page 273–280, New York, NY, Nitin Madnani and Bonnie J. Dorr. 2010. Generat- USA. Association for Computing Machinery. ing Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods. Computational Linguistics, Tommi Gröndahl and N. Asokan. 2019a. Effective 36(3):341–387. writing style imitation via combinatorial paraphras- ing. CoRR, abs/1905.13464. Kathleen R. McKeown. 1979. Paraphrasing Using Given and New Information in a Question-Answer Tommi Gröndahl and N. Asokan. 2019b. Effective System. In 17th Annual Meeting of the Associa- writing style imitation via combinatorial paraphras- tion for Computational Linguistics, pages 67–72, La ing. CoRR, abs/1905.13464. Jolla, California, USA. Association for Computa- tional Linguistics. Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2017. A Deep Generative Framework Marie Meteer and Varda Shaked. 1988. STRATEGIES for Paraphrase Generation. arXiv:1709.05074 [cs]. FOR EFFECTIVE PARAPHRASING. In Coling ArXiv: 1709.05074. Budapest 1988 Volume 2: International Conference on Computational Linguistics. Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2018. A Deep Generative Framework Nicholas Metropolis, Arianna W. Rosenbluth, Mar- for Paraphrase Generation. In Thirty-Second AAAI shall N. Rosenbluth, Augusta H. Teller, and Edward Conference on Artificial Intelligence. Teller. 2004. Equation of State Calculations by Fast Sanda Harabagiu and Andrew Hickl. 2006. Methods Computing Machines. The Journal of Chemical for Using Textual Entailment in Open-Domain Ques- Physics, 21(6):1087. Publisher: American Institute tion Answering. In Proceedings of the 21st Interna- of PhysicsAIP. tional Conference on Computational Linguistics and Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and 44th Annual Meeting of the Association for Compu- Lei Li. 2018a. CGMH: Constrained Sen- tational Linguistics, pages 905–912, Sydney, Aus- tence Generation by Metropolis-Hastings Sam- tralia. Association for Computational Linguistics. pling. arXiv:1811.10996 [cs, math, stat]. ArXiv: Shaohan Huang, Yu Wu, Furu Wei, and Zhongzhi 1811.10996. Luan. 2019. Dictionary-Guided Editing Networks Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and for Paraphrase Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33:6546– Lei Li. 2018b. CGMH: Constrained Sen- 6553. tence Generation by Metropolis-Hastings Sam- pling. arXiv:1811.10996 [cs, math, stat]. ArXiv: Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke 1811.10996. Zettlemoyer. 2018. Adversarial Example Gen- eration with Syntactically Controlled Paraphrase Amit Moryossef, Yoav Goldberg, and Ido Dagan. Networks. arXiv:1804.06059 [cs]. ArXiv: 2019. Step-by-Step: Separating Planning from 1804.06059. Realization in Neural Data-to-Text Generation. arXiv:1904.03396 [cs]. ArXiv: 1904.03396. Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-carlo Planning. In Proceedings of Daniel Naber. 2003. LanguageTool: A Rule-Based the 17th European Conference on Machine Learn- Style and Grammar Checker. ing, ECML’06, pages 282–293, Berlin, Heidelberg. Springer-Verlag. Event-place: Berlin, Germany. Joseph Olive, Caitlin Christianson, and John McCary, editors. 2011. Handbook of Natural Language Pro- Kornél Csernai. 2017. First Quora Dataset Release: cessing and Machine Translation: DARPA Global Question Pairs - Data @ Quora - Quora. Autonomous Language Exploitation. Springer- Verlag, New York. Kaori Kumagai, Ichiro Kobayashi, Daichi Mochihashi, Hideki Asoh, Tomoaki Nakamura, and Takayuki Na- Myle Ott, Sergey Edunov, Alexei Baevski, Angela gai. 2016. Human-like Natural Language Genera- Fan, Sam Gross, Nathan Ng, David Grangier, and tion Using Monte Carlo Tree Search. In Proceed- Michael Auli. 2019. fairseq: A fast, extensible ings of the INLG 2016 Workshop on Computational toolkit for sequence modeling. In Proceedings of Creativity in Natural Language Generation, pages NAACL-HLT 2019: Demonstrations. 11–18, Edinburgh, UK. Association for Computa- tional Linguistics. Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2015. PPDB 2.0: Better paraphrase ranking, fine- 2018. Paraphrase Generation with Deep Reinforce- grained entailment relations, word embeddings, and ment Learning. In Proceedings of the 2018 Con- style classification. In Proceedings of the 53rd An- ference on Empirical Methods in Natural Language nual Meeting of the Association for Computational 2109
Linguistics and the 7th International Joint Confer- deep neural networks and tree search. Nature, ence on Natural Language Processing (Volume 2: 529(7587):484–489. Short Papers), pages 425–430, Beijing, China. As- sociation for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Matt Post. 2018. A Call for Clarity in Reporting BLEU Kaiser, and Illia Polosukhin. 2017. Attention Is Scores. In Proceedings of the Third Conference on All You Need. arXiv:1706.03762 [cs]. ArXiv: Machine Translation: Research Papers, pages 186– 1706.03762. 191, Belgium, Brussels. Association for Computa- tional Linguistics. Yaushian Wang and Hung-Yi Lee. 2018. Learning to Encode Text as Human-Readable Summaries us- Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek ing Generative Adversarial Networks. In Proceed- Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. ings of the 2018 Conference on Empirical Methods 2016a. Neural paraphrase generation with stacked in Natural Language Processing, pages 4187–4195, residual LSTM networks. In Proceedings of COL- Brussels, Belgium. Association for Computational ING 2016, the 26th International Conference on Linguistics. Computational Linguistics: Technical Papers, pages Sam Witteveen and Martin Andrews. 2019. Paraphras- 2923–2934, Osaka, Japan. The COLING 2016 Orga- ing with Large Language Models. In Proceedings of nizing Committee. the 3rd Workshop on Neural Generation and Trans- Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek lation, pages 215–220, Hong Kong. Association for Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. Computational Linguistics. 2016b. Neural Paraphrase Generation with Stacked Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Residual LSTM Networks. COLING, page 12. Chen, and Chris Callison-Burch. 2016. Optimizing Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Statistical Machine Translation for Text Simplifica- Dario Amodei, and Ilya Sutskever. 2019. Language tion. Transactions of the Association for Computa- models are unsupervised multitask learners. OpenAI tional Linguistics, 4:401–415. Blog, 1(8):9. Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming Zhou, Zhoujun Li, and Jianshe Zhou. 2016. Doc- Aurko Roy and David Grangier. 2019. Unsupervised Chat: An Information Retrieval Approach for Chat- Paraphrasing without Translation. In Proceedings of bot Engines Using Unstructured Documents. In Pro- the 57th Annual Meeting of the Association for Com- ceedings of the 54th Annual Meeting of the Associa- putational Linguistics, pages 6033–6039, Florence, tion for Computational Linguistics (Volume 1: Long Italy. Association for Computational Linguistics. Papers), pages 516–525, Berlin, Germany. Associa- Tobias Schwartz and Diedrich Wolter. 2018. A Variant tion for Computational Linguistics. of Monte-Carlo Tree Search for Referring Expres- Qian Yang, Dinghan Shen, Yong Cheng, Wenlin Wang, sion Generation. In KI 2018: Advances in Artificial Guoyin Wang, Lawrence Carin, et al. 2019a. An Intelligence, pages 315–326. Springer, Cham. end-to-end generative architecture for paraphrase generation. In Proceedings of the 2019 Conference Rico Sennrich, Barry Haddow, and Alexandra Birch. on Empirical Methods in Natural Language Process- 2016a. Improving neural machine translation mod- ing and the 9th International Joint Conference on els with monolingual data. In Proceedings of the Natural Language Processing (EMNLP-IJCNLP), 54th Annual Meeting of the Association for Compu- pages 3123–3133. tational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computa- Yinfei Yang, Yuan Zhang, Chris Tar, and Jason tional Linguistics. Baldridge. 2019b. PAWS-X: A Cross-lingual Ad- versarial Dataset for Paraphrase Identification. In Rico Sennrich, Barry Haddow, and Alexandra Birch. Proceedings of the 2019 Conference on Empirical 2016b. Neural Machine Translation of Rare Words Methods in Natural Language Processing and the with Subword Units. In Proceedings of the 54th An- 9th International Joint Conference on Natural Lan- nual Meeting of the Association for Computational guage Processing (EMNLP-IJCNLP), pages 3678– Linguistics (Volume 1: Long Papers), pages 1715– 3683, Hong Kong, China. Association for Computa- 1725, Berlin, Germany. Association for Computa- tional Linguistics. tional Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kil- David Silver, Aja Huang, Chris J. Maddison, Arthur ian Q. Weinberger, and Yoav Artzi. 2019a. Guez, Laurent Sifre, George van den Driessche, Ju- BERTScore: Evaluating Text Generation lian Schrittwieser, Ioannis Antonoglou, Veda Pan- with BERT. arXiv:1904.09675 [cs]. ArXiv: neershelvam, Marc Lanctot, Sander Dieleman, Do- 1904.09675. minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Ko- Yuan Zhang, Jason Baldridge, and Luheng He. 2019b. ray Kavukcuoglu, Thore Graepel, and Demis Has- PAWS: Paraphrase Adversaries from Word Scram- sabis. 2016. Mastering the game of Go with bling. In Proceedings of the 2019 Conference of 2110
the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics. 2111
You can also read