Summary Grounded Conversation Generation - ACL Anthology

Page created by Alvin Young
 
CONTINUE READING
Summary Grounded Conversation Generation

           Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Sachindra Joshi,
                                    David Konopnicki
                                     IBM Research AI
           chulaka.gunasekara@ibm.com, {guyf, benjams}@il.ibm.com
                       {jsachind@in, davidko@il∗ }.ibm.com

                        Abstract                              137 summary transcripts. CRD3 (Rameshkumar
                                                              and Bailey, 2020) is a spoken conversation dataset
    Many conversation datasets have been con-
                                                              that consists of 159 conversations and summaries.
    structed in the recent years using crowd-
    sourcing. However, the data collection pro-               Samsum (Gliwa et al., 2019), the only large scale
    cess can be time consuming and presents many              dataset for conversation summarization, contains
    challenges to ensure data quality. Since lan-             over 16, 000 open-domain conversations and sum-
    guage generation has improved immensely in                maries created artificially by humans.
    recent years with the advancement of pre-                    Large scale pre-trained language models (PLMs)
    trained language models, we investigate how               (Lewis et al., 2020; Brown et al., 2020; Raffel
    such models can be utilized to generate entire
                                                              et al., 2020) have been used in various text gen-
    conversations, given only a summary of a con-
    versation as the input. We explore three ap-              eration tasks (Budzianowski and Vulić, 2019; Min
    proaches to generate summary grounded con-                et al., 2020; Cachola et al., 2020). In recent studies,
    versations, and evaluate the generated conver-            PLMs are used to generate training data for natu-
    sations using automatic measures and human                ral language processing (NLP) applications. For
    judgements. We also show that the accuracy of             example, Anaby-Tavor et al. (2020); Yang et al.
    conversation summarization can be improved                (2020) use PLMs to create paraphrases for intent
    by augmenting a conversation summarization
                                                              classifiers in conversation systems, and show that,
    dataset with generated conversations.
                                                              when the original datasets are augmented with the
1   Introduction                                              generated data, performance improves. More re-
                                                              cently Mohapatra et al. (2020) generated entire
Automatic conversation systems require large quan-            conversations grounded on instructions that are pro-
tities of data to learn task specific language patterns       vided to crowd-workers using a modular approach,
and underlying conversation policies. Such data ei-           where different PLMs are trained for different roles.
ther come from human-to-human conversation logs                  Our Contributions: We investigate how PLMs
(Lowe et al., 2015; Hardalov et al., 2018) or is col-         can be utilized to generate entire conversations that
lected in crowd-sourced environments, where two               are grounded on a given summary. We explore
or more crowd-workers play specific roles under               three approaches: (1) Supervised Learning (SL)
some guidelines (Zhang et al., 2018; Budzianowski             based conversation generation (SL-Gen): where,
et al., 2018). Since real human-to-human conver-              a PLM is trained to generate an entire conversa-
sation logs are scarce, many datasets have been               tion, taking the summary of a conversation as in-
created using the latter approach. However, crowd-            put, (2) Reinforced Learning (RL) based conversa-
sourced conversation data collection is time con-             tion generation (RL-Gen): where, we further im-
suming, costly and presents multiple challenges to            prove the SL-Gen method using the quality of the
ensure data quality (Kang et al., 2018).                      generated conversations as a reward, and (3) Con-
    Conversation summarization is an emerging re-             trolled turn-by-turn conversation generation (CN-
search area that has been ill-studied due to the              Gen): which allows us to generate conversations
lack of large-scale datasets. Most existing public            turn-by-turn, constrained on the summary and a
datasets in this domain are small, for example, AMI           set of pre-defined control parameters. We evalu-
meeting corpus (McCowan et al., 2005) contains                ate the quality of the generated conversations by
    ∗
        Current address: david.konopnicki@booking.com         conducting automatic and human evaluation. We

                                                         3748
               Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3748–3756
                             August 1–6, 2021. ©2021 Association for Computational Linguistics
Figure 1: The RL based conversation generation framework

also show that once a conversation summarization        summary. The similarity score is used as a reward
dataset is augmented with the generated conversa-       to train the conversation generation model. Our RL
tions, the performance of the downstream summa-         based generation framework is shown in Figure 1,
rization task is improved.                              and the critical components are described below.
                                                        Conversation Generator: A trained SL-Gen
2     Summary grounded conversation                     model is used as the conversation generator, which,
      generation                                        given an summary can generate a conversation.
In the conversation summarization task, a model         Summary Generator: We use a lightweight vari-
takes a conversation as input, and learns to generate   ant of BART (Lewis et al., 2019), named Distil-
a summary. We study the inverse of that problem,        BART, which is fine-tuned on the Extreme sum-
where the input to our model is a summary, and the      marization task (Narayan et al., 2018). We further
model generates a conversation. In this section, we     fine-tune this instance on the conversation summa-
propose three models for this task and the hyper-       rization data by providing the conversations as the
parameters used in training the models are available    input and training the model to output summaries.
in Section A of the appendix.                           Reward Model: Once the Summary Generator
                                                        generates an output summary for the generated con-
2.1    SL based generation (SL-Gen)                     versation, the reward model compares it with the
A seq2seq model can be trained for this task by         ground truth summary, which was used to ground
providing a summary as the input and generating         the conversation generation. As Paulus et al. (2018)
a conversation token-by-token. As PLMs have             we use ROUGE-2 F1-score as the reward.
shown significant improvement over the traditional      Policy training: We use proximal policy optimiza-
seq2seq architecture for text generation, we use a      tion (Schulman et al., 2017) as the optimizer for
GPT-2 model and fine-tune it to generate a con-         the policy training as it prevents the generator from
versation given a summary as the input. Our in-         deviating far away from the pretrained LM (Wu
put to the model follows the following format:          et al., 2020).
summary text conversation text. We
                                                        2.3   Controlled conversation generation
also use different token-type-ids to indicate the
summary and the conversation text. The model is         We propose another approach, (CN-Gen), for con-
trained to optimize Cross Entropy loss.                 versation generation, which grants more control
                                                        over the properties of the generated conversations.
2.2    RL based generation (RL-Gen)                     Here, we generate one utterance of the conversation
Many studies train text generation models with RL       at a time, as opposed to the RL-Gen, where we gen-
(Paulus et al., 2018; Li et al., 2016), where the       erate the whole conversation at once. The proper-
generator network is optimized with a task spe-         ties of the generated conversations is controlled by
cific reward. We investigate how the quality of the     adding several components to the input sequence to
generated conversation can be used as a reward to       the model. The following three variables were used
improve the generation network. To this end, we         as the control parameters, (1) Number of remaining
train a summary generator network, which gener-         turns to generate in the conversation (Num turns):
ates a summary, given a conversation. We measure        During the generation of a turn, we indicate the
the quality of the generated conversation by iden-      remaining number of turns in the conversation. In
tifying the similarity between the summary of the       generating a n turn conversation, this starts with
generated conversation (generated, in turn, by the      n for the first turn and reduces by 1 after the gen-
summary generator network) and the ground truth         eration of each turn, (2) The speaker of the next

                                                   3749
Summary: person0 will be late. person1 will order pasta with salmon and basil for her.            Model           Ave. Turns     Ave. Tokens/Turn
      2 turn conversation:                          3 turn conversation
                                                                                                      Ground truth    11.55 ± 6.48   7.10 ± 6.29
       I’ll be late                         I’ll be late.                           SL-Conv-Gen     10.54 ± 6.80   5.69 ± 4.40
       I’ll order some pasta                I’ll order some pasta
                with salmon and basil for                     with salmon and basil for
                                                                                                      RL-Conv-Gen     8.40 ± 4.78    5.14 ± 3.64
                you.                                          you.                                    CN-Conv-Gen     9.70 ± 5.67    5.62 ± 4.05
                                                     Thanks a lot!

      6 turn conversation                           10 turn conversation
                                                                                                    Table 2: Properties of the generated conversations.
       Hello, I am going to be              I’ll be late
                late.                                ok
       Ok                                   do you want me to order
       I’ll order some pasta
                with salmon and basil
                                                              something for you?
                                                     pasta?
                                                                                                sures and human judgments, and then assess the
       Ok, sounds good!
       Thank you!
                                                     Yes
                                                     with salmon?
                                                                                                performance of the generated conversations in a
       No problem                           Yes
                                                     Ok
                                                                                                downstream summarization task after augmenta-
                                                     how about basil?
                                                     Yes please!
                                                                                                tion.

Table 1: Multiple conversations generated by the CN-                                            3.1     Quality of the generated conversations
Gen approach grounded on the same summary                                                       We evaluate the quality of the conversations gener-
                                                                                                ated by the three approaches that were introduced
turn (Speaker): This indicates to the model the                                                 in Section 2. In Table 2 we show the properties
speaker of the next turn, and (3) The length of the                                             of generated conversations and the ground truth
next turn (Turn length): We define, 3 categories of                                             conversations in the test set of Samsum dataset.
lengths: Short (≤ 3 tokens), Long (> 10 tokens)                                                    Automatic Evaluation: We trained the con-
and Medium (otherwise).                                                                         versation generation models on the Samsum train-
   We use the following input representation                                                    ing set and generated conversations on the test set.
to fine-tune a GPT-2 model:  summary                                                       We compare the generated conversation with the
text  dialog context  Num turns
                                                                                                ground truth conversations using the measures used
 speaker  turn length  ut-
                                                                                                by Sharma et al. (2017) to evaluate conversation
terance . Changing these parameters allows                                                 system responses. The results shown in Table 3
us to generate different variants of conversations                                              suggest that CN-Gen outperform the SL-Gen and
which are grounded on the same summary. During                                                  RL-Gen on all measures.
training, we obtain the values for the control pa-                                                 We also compare the summaries of generated
rameters from the ground truth conversations, and                                               conversations (generated by the Summary Gener-
at inference we randomly select the next speaker,                                               ator) with the ground truth summaries, and the
number of turns of the conversation to be gener-                                                results are shown in Table 4. We believe that this is
ated (in a range of 4-15 turns), and the next turn                                              a semantic evaluation of the conversations, as the
length. In Table 1 we show conversations of dif-                                                summaries capture the crux of the conversations.
ferent lengths that were generated by the CN-Gen                                                According to the results, CN-Gen outperforms the
approach grounded on the same summary by chang-                                                 other two methods. This, along with the previous
ing the control parameters.                                                                     result suggest that the conversations produced by
   A summary and a conversation from the Sam-                                                   CN-Gen are the most similar to the ground truth
sum dataset (Gliwa et al., 2019), along with the con-                                           conversations.
versations generated by the three aforementioned                                                   Human Evaluation: To evaluate the quality of
algorithms are shown in Figure 2. More examples                                                 generated conversations, we randomly selected 50
are provided in the Section B of the Appendix.                                                  summaries from the Samsum test dataset and gen-
                                                                                                erated conversations using the three models. Three
3       Experiments                                                                             NLP experts were then asked to read the ground
                                                                                                truth summary and rank the four conversations (3
We experiment on the Samsum (Gliwa et al., 2019)                                                generated and the ground truth conversation) us-
dataset, which, to the best of our knowledge, is the                                            ing a [1-5] scale according to Grammaticality, Co-
only public large-scale conversation summarization                                              herency, and Informativeness, with respect to the
dataset. We pre-process the dataset by replacing                                                ground truth summary. Results are shown in ta-
the personal names (ex: John) with unique tags                                                  ble 5. As expected, the ground-truth conversations
(ex:). First, we evaluate of the quality                                             obtained the highest scores on all three aspects and
of generated conversations using automatic mea-                                                 can be considered as an upper bound for this task.

                                                                                             3750
Figure 2: Examples of a conversations grounded on the same summary. The key terms are highlighted in colors.

  Model      BLEU-4     METEOR       ROUGE-L                  Model     ROUGE 1     ROUGE 2      ROUGE L
  SL-Gen     2.81       12.06        21.53                    SL-Gen    46.85       25.29        45.97
  RL-Gen     3.53       12.29        25.40                    RL-Gen    52.51       31.23        51.68
  CN-Gen     4.94       15.64        26.22                    CN-Gen    53.46       32.52        52.93

Table 3: Evaluation of generated conversations              Table 4: Rouge F1 evaluation of summaries of con-
against ground truth conversations                          versations against the ground truth summaries

RL-Gen and CN-Gen obtained higher scores than            models were applied on the other (y=100-x%) of
SL-Gen and relatively good scores compared to            the summaries and generated conversations, (3)
the Ground Truth conversations. This corroborates        Those generated conversations along with the orig-
the assumption that our proposed models generate         inal summaries were added to the data. Using this
high quality conversations. The Welch Two Sam-           approach, we can add extra y% (summary, conver-
ple t-test (Welch, 1947) shows that both RL-Gen          sation) pairs to the training data, (4) The conver-
and CN-Gen models outperform the SL-Gen model            sation summarization model (discussed in Section
statistically significantly with p < 0.0001. How-        2 under ‘Summary Generator‘) was trained on the
ever, there is no statistical significance between the   augmented data. We compare the performance of
results obtained from RL-Gen and CN-Gen. We              the conversation summarization model on the orig-
report in Table 6 the average quadratic Cohen’s          inal dataset and with augmentation.
Kappa calculated over the three possible combina-           Automatic Evaluation: We compare the three
tions of two judges (Toledo et al., 2019).               conversation generation methods at different aug-
   CN-Gen obtained the best scores during the auto-      mentation percentages, and the results are shown
matic evaluation, while RL-Gen got the best scores       in Table 7. At all augmentation levels, the summa-
from the human evaluation. The CN-Gen conver-            rization models trained with augmented data out-
sations are longer than the RL-Gen conversation          perform the summarization model trained on the
by 1.3 turns on average (see Table 2), and hence         original dataset (without augmentation). CN-Gen
would contain more word overlap with the ground          based augmentation produces the best accuracy
truth. This results in better automatic evaluation       compared to other two methods. One prevalent pat-
scores for the CN-Gen, while the humans prefer           tern is that, when augmentation data increases, the
short targeted conversations generated by RL-Gen.        accuracy seems to increase up to a certain point and
                                                         then starts to decrease. The best accuracies were
3.2   Evaluation on the summarization task               found around 30% data augmentation. We believe
To further evaluate the quality of the generate con-     that more augmentation leads performance to drop
versations, we augmented a conversation summa-           due to the following reason. For augmenting with
rization dataset with generated conversations and        more data, we are left with less data to train the
evaluated the summarization model. We followed           model for conversation generation (for 10% aug-
the following process: (1) We randomly selected          mentation, the conversation generation models are
x% of the summaries of the dataset and trained our       trained on 90% of the data, while for 50% augmen-
conversation generation models, (2) The trained          tation, the models are trained only on 50% of the

                                                    3751
Model          Info   Gram    Cohe                          Model          Info   Gram   Cohe
          Ground-Truth   4.56   4.46    4.47                          Ground-Truth   0.04   0.22   0.25
          SL-Gen         2.22   2.85    2.37                          SL-Gen         0.35   0.26   0.42
          RL-Gen         3.20   3.50    3.14                          RL-Gen         0.47   0.35   0.45
          CN-Gen         3.10   3.43    3.09                          CN-Gen         0.60   0.40   0.60

Table 5: Human evaluation of generated conversa-               Table 6: Average Cohen’s Kappa for human evalua-
tions                                                          tion of generated conversations

            Augmentation %   ROUGE 1   ROUGE 2   ROUGE L
 Method                                                    score of each system is computed as the percentage
            0% (Original)      51.84     30.98     43.98
            10%                52.82     31.99     44.89   of times it was chosen as the Best system minus
            20%                52.90     32.01     44.97   times it was chosen as Worst. On the Coherency
 SL-Gen     30%                52.88     32.02     45.01
            40%                52.61     31.98     44.96
                                                           question, RL-Gen, CN-Gen and No-Aug obtained
            50%                52.55     31.98     44.80   scores of 12.6, 6.6 and -4.0 respectively. On the
            10%                52.93     32.05     44.92   Focus question RL-Gen, CN-Gen, and No-Aug
            20%                53.30     32.15     45.20
 RL-Gen     30%                53.81     32.21     45.77   obtained scores of 14.6, 6.0 and -2.6 respectively.
            40%                52.86     32.06     44.99   These results confirm that the use of augmentation
            50%                52.64     32.07     44.88
            10%                53.29     32.36     45.08
                                                           improves the quality of the summaries.
            20%                53.36     32.53     45.27
 CN-Gen     30%                54.02     33.28     46.06
            40%                52.14     31.76     44.14   4   Conclusion
            50%                52.36     31.75     44.85
                                                           We investigated how the PLMs can be utilized to
 Table 7: ROUGE F-1 evaluation on Samsum test set.         generate entire conversations that are grounded on
                                                           a summary. We propose three approaches for con-
data). Therefore as the augmentation increases, the        versation generation: SL-Gen, RL-Gen and CN-
quality of generated conversations go down. This           Gen and conducted multiple automatic and human
leads to overall smaller gains in the summariza-           evaluations to assess the quality of the generated
tion task with increased augmentation after some           conversations. Both automatic and human eval-
point. To neutralize the effect of increasing the data     uations show that when compared to the ground
points during augmentation, we experimented with           truth conversations, RL-Gen and CN-Gen obtain
a baseline which over-samples the original training        high scores, suggesting that the proposed models
data at different percentages to obtain same num-          generate high quality conversations. When a con-
ber of training instances as the augmented datasets.       versation summarization dataset is augmented with
While the ROUGE-2 obtained with the original               the generated conversations, the performance of
training data is 30.98, oversampling at 10%, 20%,          conversation summarization is improved (over to
30%, 40% and 50%, only changes the ROUGE-2                 7% improvement in ROUGE-2 F-1), which also
to 30.55, 30.38, 30.74, 30.99 and 30.27 respec-            suggests that the proposed methods generate high
tively. Hence, this suggests that oversampling             quality conversations.
hardly changes ROUGE scores obtained by train-
ing with the original dataset, while the augmenta-         5   Ethics
tion according to our algorithms show significantly
improved scores (as shown in Table 7).                     We have used the publicly available Samsum
   Human Evaluation: We recruited 3 NLP ex-                dataset (https://huggingface.co/datasets/
perts to evaluate 50 instances of summaries gener-         samsum).     For the human evaluation of both
ated with data augmentation (RL-Gen, CN-Gen),              conversations and summaries, we recruited 3 NLP
and respective summaries generated without aug-            researchers, who have graduate degree in NLP
mentation (No-Aug). Here we consider two as-               and Machine Learning. The annotation task itself
pects with respect to a ground-truth summary: Co-          was executed on Appen.com platform. Before the
herency (whether the summary is easy to read) and          official annotation, we sampled 10 tasks to get an
Focus (whether the summary represents the ground-          estimate of the duration of the task, and to make
truth summary). Following (Amplayo and Lapata,             sure the instructions are clear enough.
2020) we use the Best-Worst Scaling method. The

                                                       3752
References                                                Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
                                                            jan Ghazvininejad, Abdelrahman Mohamed, Omer
Reinald Kim Amplayo and Mirella Lapata. 2020. Un-           Levy, Veselin Stoyanov, and Luke Zettlemoyer.
  supervised opinion summarization with noising and         2020. Bart: Denoising sequence-to-sequence pre-
  denoising. In Proceedings of the 58th Annual Meet-        training for natural language generation, translation,
  ing of the Association for Computational Linguistics,     and comprehension. In Proceedings of the 58th An-
  pages 1934–1945.                                          nual Meeting of the Association for Computational
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,        Linguistics, pages 7871–7880.
  Amir Kantor, George Kour, Segev Shlomov, Naama
                                                          Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,
  Tepper, and Naama Zwerdling. 2020. Do not have
                                                             Michel Galley, and Jianfeng Gao. 2016. Deep rein-
  enough data? deep learning to the rescue! In Pro-
                                                             forcement learning for dialogue generation. In Pro-
  ceedings of the AAAI Conference on Artificial Intel-
                                                             ceedings of the 2016 Conference on Empirical Meth-
  ligence, volume 34, pages 7383–7390.
                                                             ods in Natural Language Processing, pages 1192–
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie             1202.
  Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
                                                          Ryan Lowe, Nissan Pow, Iulian Vlad Serban, and
  Neelakantan, Pranav Shyam, Girish Sastry, Amanda
                                                            Joelle Pineau. 2015. The ubuntu dialogue corpus: A
  Askell, et al. 2020. Language models are few-shot
                                                            large dataset for research in unstructured multi-turn
  learners. arXiv preprint arXiv:2005.14165.
                                                            dialogue systems. In Proceedings of the 16th An-
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s       nual Meeting of the Special Interest Group on Dis-
  gpt-2-how can i help you? towards the use of pre-         course and Dialogue, pages 285–294.
  trained language models for task-oriented dialogue
  systems. In Proceedings of the 3rd Workshop on          Iain McCowan, Jean Carletta, Wessel Kraaij, Simone
  Neural Generation and Translation, pages 15–22.            Ashby, S Bourban, M Flynn, M Guillemot, Thomas
                                                             Hain, J Kadlec, Vasilis Karaiskos, et al. 2005. The
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang               ami meeting corpus. In Proceedings of the 5th In-
  Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-           ternational Conference on Methods and Techniques
  madan, and Milica Gasic. 2018. Multiwoz-a large-           in Behavioral Research, volume 88, page 100. Cite-
  scale multi-domain wizard-of-oz dataset for task-          seer.
  oriented dialogue modelling. In Proceedings of the
  2018 Conference on Empirical Methods in Natural         Sewon Min, Julian Michael, Hannaneh Hajishirzi, and
  Language Processing, pages 5016–5026.                     Luke Zettlemoyer. 2020. Ambigqa: Answering am-
                                                            biguous open-domain questions. In Proceedings of
Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S          the 2020 Conference on Empirical Methods in Nat-
   Weld. 2020. Tldr: Extreme summarization of sci-          ural Language Processing (EMNLP), pages 5783–
   entific documents. In Proceedings of the 2020 Con-       5797.
   ference on Empirical Methods in Natural Language
   Processing: Findings, pages 4766–4777.                 Biswesh Mohapatra, Gaurav Pandey, Danish Con-
                                                            tractor, and Sachindra Joshi. 2020. Simulated
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and              chats for task-oriented dialog: Learning to gener-
  Aleksander Wawer. 2019. Samsum corpus: A                  ate conversations from instructions. arXiv preprint
  human-annotated dialogue dataset for abstractive          arXiv:2010.10216.
  summarization. EMNLP-IJCNLP 2019, page 70.
                                                          Shashi Narayan, Shay B Cohen, and Mirella Lapata.
Momchil Hardalov, Ivan Koychev, and Preslav Nakov.          2018. Don’t give me the details, just the summary!
 2018. Towards automated customer support. In               topic-aware convolutional neural networks for ex-
 International Conference on Artificial Intelligence:       treme summarization. In Proceedings of the 2018
 Methodology, Systems, and Applications, pages 48–          Conference on Empirical Methods in Natural Lan-
 59. Springer.                                              guage Processing, pages 1797–1807.

Yiping Kang, Yunqi Zhang, Jonathan K Kummerfeld,          Romain Paulus, Caiming Xiong, and Richard Socher.
  Lingjia Tang, and Jason Mars. 2018. Data collec-          2018. A deep reinforced model for abstractive sum-
  tion for dialogue system: A startup perspective. In       marization. In International Conference on Learn-
  Proceedings of the 2018 Conference of the North           ing Representations.
  American Chapter of the Association for Computa-
  tional Linguistics: Human Language Technologies,        Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Volume 3 (Industry Papers), pages 33–40.                  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
                                                            Wei Li, and Peter J Liu. 2020. Exploring the lim-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-                   its of transfer learning with a unified text-to-text
  jan Ghazvininejad, Abdelrahman Mohamed, Omer              transformer. Journal of Machine Learning Research,
  Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.           21(140):1–67.
  Bart: Denoising sequence-to-sequence pre-training
  for natural language generation, translation, and       Revanth Rameshkumar and Peter Bailey. 2020. Story-
  comprehension. arXiv preprint arXiv:1910.13461.           telling with dialogue: A critical role dungeons and

                                                     3753
dragons dataset. In Proceedings of the 58th Annual    training and inference are shown below. The model
    Meeting of the Association for Computational Lin-     takes around 6 hours to train on 2 V100 GPUs
    guistics, pages 5121–5134.
                                                          (single machine).
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
  Radford, and Oleg Klimov. 2017.          Proximal       model_name_or_path: gpt2
  policy optimization algorithms. arXiv preprint          per_gpu_train_batch_size: 4
  arXiv:1707.06347.                                       per_gpu_eval_batch_size: 4
                                                          gradient_accumulation_steps: 4
Shikhar Sharma, Layla El Asri, Hannes Schulz, and         learning_rate: 6.25e-5
  Jeremie Zumer. 2017. Relevance of unsupervised          adam_epsilon: 1e-8
  metrics in task-oriented dialogue for evaluating nat-   max_grad_norm: 1.0
  ural language generation. CoRR, abs/1706.09799.         num_train_epochs: 10
                                                          warmup_steps: 500
Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni          min_length: 20
  Friedman, Elad Venezian, Dan Lahav, Michal Ja-          max_length: 512
  covi, Ranit Aharonov, and Noam Slonim. 2019. Au-        top_k: 0
  tomatic argument quality assessment-new datasets        top_p: 0.95
  and methods. In Proceedings of the 2019 Confer-
  ence on Empirical Methods in Natural Language
  Processing and the 9th International Joint Confer-
                                                          A.2   Summary Generator
  ence on Natural Language Processing (EMNLP-             We use DistilBART instance1 fine-tuned on the
  IJCNLP), pages 5629–5639.                               extreme summarization (XSum) task, and we fine-
Bernard L Welch. 1947. The generalization ofstu-          tune this model further on the Samsum dataset. The
  dent’s’ problem when several different population       model takes around 12 hours to train on 2 V100
  variances are involved. Biometrika, 34(1/2):28–35.      GPUs (single machine).
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien             The hyperparameters used for training the Dis-
  Chaumond, Clement Delangue, Anthony Moi, Pier-          tilBART model are as follows:
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, et al. 2019. Huggingface’s transformers: State-    train_batch_size: 4
                                                          eval_batch_size: 4
  of-the-art natural language processing. ArXiv, pages
                                                          num_train_epochs: 10
  arXiv–1910.                                             model_name_or_path: sshleifer/distilbart
                                                          -xsum-12-6
Qingyang Wu, Lei Li, and Zhou Yu. 2020. Textgail:         learning_rate: 3e-5
  Generative adversarial imitation learning for text      val_check_interval: 0.1
  generation. arXiv preprint arXiv:2004.13796.            max_source_length: 512
                                                          max_target_length: 80
Yiben Yang, Chaitanya Malaviya, Jared Fernandez,
  Swabha Swayamdipta, Ronan Le Bras, Ji-Ping              A.3   Reinforced Learning based conversation
  Wang, Chandra Bhagavatula, Yejin Choi, and Doug
  Downey. 2020. G-daug: Generative data augmenta-               generation (RL-Conv-Gen)
  tion for commonsense reasoning. In Proceedings of       To train the RL based conversation generation
  the 2020 Conference on Empirical Methods in Nat-        model, we adapted a publicly available Proxi-
  ural Language Processing: Findings, pages 1008–
  1025.                                                   mal Policy Optimization (PPO) implementation
                                                          2 . The model takes around 12 hours to train on 2
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur         V100 GPUs (single machine). Following hyper-
  Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
  sonalizing dialogue agents: I have a dog, do you        parameters were used to train the model.
  have pets too?    In Proceedings of the 56th An-        steps: 10000
  nual Meeting of the Association for Computational       batch_size: 16
  Linguistics (Volume 1: Long Papers), pages 2204–        forward_batch_size: 4
  2213.                                                   learning_rate: 1.41e-5
                                                          init_kl_coef:0.2
A     Model Training and Hyperparameter                   target: 6
                                                          horizon:10000
      Details                                             gamma:1
                                                          lam:0.95
A.1    Supervised Conversation Generation                 cliprange: 0.2
       (SL-Conv-Gen)                                      cliprange_value: 0.2
                                                          vf_coef: 0.1
We fine-tune a GPT-2 language model using the               1
                                                              https://huggingface.co/sshleifer/
implementation available at HuggingFace (Wolf             distilbart-cnn-12-6
                                                            2
et al., 2019). The hyper-parameters used during               https://github.com/lvwerra/trl

                                                     3754
B   Sample summaries with corresponding
    ground-truth
Figure 3 shows some samples of dialogs with their
corresponding summaries - ground-truth and auto-
matic generated ones.

                                                3755
Summary: Person0 closed some deals today. Person1 didn’t manage to do it.

  Ground Truth Dialog:                          SL-Gen:                                      RL-Gen:                                       CN-Gen:
   It’s been a long day.                   Hi, are you still alive?         Hey, how are you?                    Did      you    see    this?
   Same here, Person0, good                Hi, yes, still alive.            Good, busy as usual.                          ¡file other¿
            or bad?                                 I closed some deals today.       I closed some deals today.           Cool. What’s up?
   Very busy. I closed some                Yeah, that’s awesome.            Too late, didn’t manage to           I closed some deals today
            deals but I had a lot of                Thanks.                                   do it.                                        and I feel good about it.
            stress also.                                                                      Sorry.                               Great. I didn’t manage to
   Lucky you. I couldn’t                                                                                                                    do it today. I’m sorry. I
            close any deal.       It an-                                                                                                             had no choice.
            noys me. Some customers                                                                                                         I’m sorry.
            are negotiating for days,                                                                                                       I didn’t have the tools.
            weeks.
   Maybe they’re just fishing.
            Want to compare offers.
   That’s what they do.

Summary: Person0 bought a table, six chairs, a vase and a pile of clothes and the second hand shop downtown. She paid 70 euros for everything.

  Ground Truth Dialog:                          SL-Gen:                                      RL-Gen:                                       CN-Gen:
   I went to the second hand            I bought the second hand            Do you remember the sec-             Look what I’ve bought! A
            shop downtown.                                shop downtown.                               ond hand in downtown?                         table, six chairs, a vase. I
   Cool Person0 What did                So what happened?                   Yes, I do. What’s inside?                     bought a table, six chairs, a
            you get?                             I bought everything.                A table, six chairs, a vase                   vase and a pile of clothes.
   Lots of stuff. A table, six          What do you mean?                            and a pile of clothes.               How much did you pay for
            chairs, a vase, a pile of            I paid 70 euro for every-           Is it expensive?                              everything?
            clothes                                       thing.                              70 euros.                            I paid 70 euro for every-
   That’s really a lot of stuff         How much?                                                                                  thing. And I thought it
            Person0. Send us pictures            6 chairs, six chairs, a vase                                                               would cost me a fortune!
   ¡photo file¿                                  and a pile of clothes.                                                                     I don’t even know how
   Beautiful table.Is it wood?          Sounds like a bargain.                                                                     much.
   Yes, oak.                            ou should be in business.                                                         I am glad.
   Awesome! These are the                                                                                                          Me too. 70 euro for every-
            clothes? I love the black                                                                                                                thing, including a vase!!!
            dress.
   It’s absolutely beautiful.
            As soon as I saw it I knew
            it would be mine. I’m so
            happy with what I bought.

Summary: Person1 is not at home. Person0 wants Person1 to keep her pasta in the microwave.

  Ground Truth Dialog:                          SL-Gen:                                      RL-Gen:                                       CN-Gen:
   U both at home?                      How are you doing?                  Are you at home?                     Are you at home?
   No. I’ve just got ur msg.            I’m not at home                     No, I’m still at work.               No, I’m busy at work.
            Why did u ask about it?              Why?                                Good, I should be home in                     What do you want?
   No reason. Keep my pasta             I wanted to keep some                        about 15 minutes.                    I want to put some pasta in
            in the microwave.                             pasta in the microwave.             Can you put my pasta in                       microwave.
   I haven’t cooked anything.           No need to.                                  the microwave?                       OK. I’ll try.
                                                 In the microwave?                   Sure, I’ll check it.
                                                 Yes.                                Thanks!

Summary: Person0 needs Person1’s help as he cannot get the application running.

  Ground Truth Dialog:                          SL-Gen:                                      RL-Gen:                                       CN-Gen:
   Person1, I need your help?           Hey! I have a problem                  Person1, you there?               I cannot get the application
   What’s up, mate?                              with the application.                  Yeah, what’s up?                           running.
   I can’t get the application          Hello, Person0                         Do you need my help?              What’s up?
            running.                             Could you please help me               What’s wrong?                     Can you get the applica-
   Have you switched the                         to get the application run-            I can’t get the application                tion running?
            computer on?                                  ning? I cannot.                                 running.                          You sure you can’t? Try
   Very funny!!!!!!!!!!!!!              Why?                                   Did you try charging it                    and click on it.
   OK. Sorry. I can see it’s            I have a problem with my                        first?                            You’re so funny, Person1.
            serious.                                      laptop. The application is             Yes, I tried that, but it         OK. I’m coming.
   Yeah, man. It is f**cking                     running on the USB port.                        doesn’t work.                     Thanks.
            serious.                                      I cannot access the com-               Try      refreshing     the
   I’ll be with you right now.                   puter. The computer is run-                     database.
   Thanks.                                       ning. If I try to access the           Damn, thanks!
                                                          computer, the application
                                                          will start. I hope that you
                                                          can help me.
                                                 Okay.
                                                 Thank you, Person1.

Summary: Person0 and Person1 will meet the new person in an hour.

  Ground Truth Dialog:                          SL-Gen:                                      RL-Gen:                                       CN-Gen:
   Can we meet with the new             Hi, babe.                           Did you get that message?            Wanna meet the new per-
            person soon?                         Hi.                                 No, why?                                      son?
   Sure. In an hour okay?               We’re meeting in an hour?           I might have lost it.                Sure, I’ll be there in an
   Perfect.                             I’d prefer to be there in an        Oh no, it’s my mistake.                       hour.
                                                          hour.                                        I’m sorry.                           Perfect!
                                                 Great!                              Ok, don’t worry. We can
                                                                                                       meet in an hour?
                                                                                              Ok!

Figure 3: Samples of dialogs with their corresponding summaries - ground-truth and automatic generated ones

                                                                                     3756
You can also read