ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

Page created by Dan Hunter
 
CONTINUE READING
ConvoSumm: Conversation Summarization Benchmark
     and Improved Abstractive Summarization with Argument Mining
       Alexander R. Fabbri[†] Faiaz Rahman[†] Imad Rizvi[†] Borui Wang[†]
                 Haoran Li [‡] Yashar Mehdad[‡] Dragomir Radev[†]
                           [†] Yale University [‡] Facebook AI
              {alexander.fabbri, faiaz.rahman, imad.rizvi,
                   borui.wang, dragomir.radev}@yale.edu
                           {aimeeli, mehdad}@fb.com

                     Abstract                                 Headline: SuperBowl
                                                              Snippet: Whether you’re a football fan or not, what do
                                                              you like about Super Bowl Sunday?
    While online conversations can cover a vast
                                                              Comment: ... In my opinion I think the Falcons will
    amount of information in many different for-              stomp the patriots. I think Tom Brady will choke the Super
    mats, abstractive text summarization has pri-             Bowl. ...
    marily focused on modeling solely news ar-                Comment: I am big Arizona Cardinals fan so when they
    ticles. This research gap is due, in part, to             didn’t even make the playoffs i was upset. ...
    the lack of standardized datasets for summa-              Comment: I’m not a very big football fan at all. So
                                                              when it comes to Superbowl Sunday, I’m in it for the
    rizing online discussions. To address this gap,           commercials and the half time show. ...
    we design annotation protocols motivated by               Comment: I am not exactly a football fan, but I enjoy
    an issues–viewpoints–assertions framework to              watching the Super Bowl....
    crowdsource four new datasets on diverse on-              ...
    line conversation forms of news comments,                 Summary:
                                                              Several commenters list their favorite things about the
    discussion forums, community question an-
                                                              Super Bowl, including half-time shows, the funny com-
    swering forums, and email threads. We bench-              mercials, the Puppy Bowl, eating food, and spending time
    mark state-of-the-art models on our datasets              with family. A couple of commenters admit to not being
    and analyze characteristics associated with the           football fans but still enjoying the Super Bowl. Some com-
    data. To create a comprehensive benchmark,                menters discuss whether they thought the Falcons or the
                                                              Patriots were going to win, while others list teams they
    we also evaluate these models on widely-used
                                                              wish were in the game.
    conversation summarization datasets to estab-
    lish strong baselines in this domain. Fur-              Table 1: Example summary of comments from a New
    thermore, we incorporate argument mining                York Times article discussing people’s favorite parts of
    through graph construction to directly model            the Super Bowl. The summary is an analysis of the
    the issues, viewpoints, and assertions present          comments and quantifies the viewpoints present.
    in a conversation and filter noisy input, show-
    ing comparable or improved results according
    to automatic and human evaluations.                     Unlike documents, articles, and scientific papers,
                                                            which contain specific linguistic structures and con-
1   Introduction
                                                            ventions such as topic sentences and abstracts, con-
Automatic text summarization is the process of              versational text scatters main points across multiple
outputting the most salient parts of an input in a          utterances and between numerous writers. As a
concise and readable form. Recent work in sum-              result, the text summarization task in the conver-
marization has made significant progress due to             sational data domain offers a challenging research
introducing large-scale datasets such as the CNN-           field to test newly-developed models (Chen and
DailyMail dataset (Nallapati et al., 2016) and the          Yang, 2020).
New York Times dataset (Sandhaus, 2008). Further-              Recently, Gliwa et al. (2019a) introduced a
more, the use of large self-supervised pretrained           dataset for chat-dialogue conversation summariza-
models such as BART (Lewis et al., 2020) and                tion consisting of 16k examples, the first large-
Pegasus (Zhang et al., 2019) has achieved state-            scale dataset of its kind. Previous work in con-
of-the-art performance across summarization tasks           versation summarization was limited by the data
and strong performance in zero and few-shot set-            available and focused primarily on meeting sum-
tings (Fabbri et al., 2020a). However, less work            marization, such as the AMI (Kraaij et al., 2005)
has focused on summarizing online conversations.            and ICSI (Janin et al., 2003) datasets. The datasets

                                                       6866
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 6866–6880
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
used in recent conversation papers are often not uni-   ment structure (discussed in Related Work) for sum-
form, ranging from visual dialogue data (Goo and        marizing news comments. We construct this argu-
Chen, 2018a) to customer-service dialogues (Yuan        ment graph using entailment relations, linearize the
and Yu, 2019), not initially intended for summa-        graph, train a graph-to-text model (Ribeiro et al.,
rization. The availability of benchmark datasets for    2020), and experiment with argument mining as a
comparing methods has limited work in other con-        way to reduce noise in long-text input.
versation summarization domains and thus likely            Our contributions are the following: (1) we
inhibited progress (Kryscinski et al., 2019; Fabbri     crowdsource datasets for four domains of conver-
et al., 2020b).                                         sational data and analyze the characteristics of our
   We aim to address this research gap by crowd-        proposed datasets; (2) we benchmark state-of-the-
sourcing a suite of four datasets, which we call        art models on these datasets as well as previous
ConvoSumm, that can evaluate a model’s perfor-          widely-used conversation summarization datasets
mance on a broad spectrum of conversation data. In      to provide a clear baseline for future work; and
determining the domains of data to collect, we use      (3) we apply argument mining to model the struc-
the general definition of conversation as “any dis-     ture of our conversational data better as well as
course produced by more than one person” (Ford,         reduce noise in long-text input, showing compa-
1991). We identify several key categories of data       rable or improved results in both automatic and
for which standard human-created development            human evaluations.1
and testing datasets do not exist, namely (1) news
article comments, (2) discussion forums and debate,     2       Related Work
(3) community question answering, and (4) email
                                                        Modeling Conversation Summarization Early
threads. We design annotation protocols motivated
                                                        approaches to conversation summarization con-
by work in quantifying viewpoints present in news
                                                        sisted of feature engineering (Shasha Xie et al.,
comment data (Barker and Gaizauskas, 2016a) to
                                                        2008), template selection methods (Oya et al.,
crowdsource 250 development and 250 test exam-
                                                        2014), and statistical machine learning approaches
ples for each of the above domains. We provide an
                                                        (Galley, 2006; Wang and Cardie, 2013). More re-
example of comments to a New York Times news
                                                        cent modeling approaches for dialogue summariza-
article, and our crowdsourced summary in Table 1.
                                                        tion have attempted to take advantage of conver-
   In addition to introducing manually-curated          sation structures found within the data through di-
datasets for conversation summarization, we also        alogue act classification (Goo and Chen, 2018b),
aim to unify previous work in conversation summa-       discourse labeling (Ganesh and Dingliwal, 2019),
rization. Namely, we benchmark a state-of-the-art       topic segmentation (Liu et al., 2019c), and key-
abstractive model on several conversation datasets:     point analysis (Liu et al., 2019a). Chen and
dialogue summarization from SAMSum (Gliwa               Yang (2020) utilize multiple conversational struc-
et al., 2019b), heuristic-generated community ques-     tures from different perspectives in its sequence-to-
tion answering from CQASumm (Chowdhury and              sequence model. However, such approaches focus
Chakraborty, 2018), meeting summarization data          exclusively on dialogue summarization, and it is
from AMI and ICSI, and smaller test sets in the         not trivial to extend such methods to longer con-
news comments, discussion forum, and email do-          versations with many more participants. We thus
mains. We believe that such benchmarking will           introduce a method to model the structure of the
facilitate a more straightforward comparison of con-    discourse over the many-party conversation.
versation summarization models across domains.             Several existing works have focused on con-
   To unify modeling across these conversational        ceptualizing conversation structure for summa-
domains, we propose to use recent work in end-to-       rization and how to present this structure to end-
end argument mining (Lenz et al., 2020; Stab and        users. Barker et al. (2016a) propose a conversation
Gurevych, 2014; Chakrabarty et al., 2019) to instan-    overview summary that aims to capture the key
tiate the theoretical graph framework which moti-       argumentative content of a reader comment con-
vated our annotation protocol, proposed by Barker       versation. Misra et al. (2017) use summarization
and Gaizauskas (2016a) for conversation summa-              1
                                                            For reproducibility of our findings, we will make our data
rization. This protocol is employed to both identify    and code publicly available at https://github.com/
and use the “issues–viewpoints–assertions” argu-        Yale-LILY/ConvoSumm.

                                                   6867
as a means of probing online debates to discover         3   ConvoSumm
central propositions, which they cluster to identify
argument facets. Barker and Gaizauskas (2016b)           In this section, we introduce our dataset selection,
identify three key components of conversational di-      our annotation protocol, and the characteristics of
alogue: issues (that individuals discuss), viewpoints    our crowdsourced dataset.
(that they hold about these issues), and assertions      Data Selection For the news comments subdo-
(that they make to support their viewpoints). We         main, we use the NYT Comments dataset, which
build on this framework and advances in argument         consists of 2 million comments made on 9,000
mining for end-to-end training for summarization.        New York Times articles published between 2017
                                                         and 2018. It is publicly available and has been
Argument Mining Work in argument mining                  used in work for news-comment relevance mod-
(Stab and Gurevych, 2014) has aimed to iden-             eling (Kolhatkar and Taboada, 2017); it also con-
tify these argumentative units and classify them         tains metadata that may be of use in summarization
into claims, premises, and major claims, or claims       modeling. For the discussion forums and debate
describing the key concept in a text. More re-           subdomain, we select Reddit data from CoarseDis-
cently, Chakrabarty et al. (2019) propose to fine-       course (Zhang et al., 2017), which contains anno-
tune BERT (Devlin et al., 2019) for identifying ar-      tations about the discourse structure of the threads.
gumentative units and relationships between them         For the community question answering subdomain,
within a text and across texts. Lenz et al. (2020) are   we use StackExchange (Stack), which provides ac-
the first to propose an end-to-end approach for con-     cess to all forums and has been used in modeling
structing an argument graph (Stede et al., 2016),        for answer relevance and question deduplication
a structured representation of claims and premises       (Hoogeveen et al., 2015). We chose StackExchange
in an argumentative text; the graph is built by con-     over the commonly-used Yahoo! Answers data due
necting claim and premise argumentative discourse        to licensing reasons. For the email threads subdo-
units. We build on this framework for modeling           main, we use the publicly-available W3C corpus
discourse in conversational data.                        (Craswell et al., 2005). Previous work also made
                                                         use of this dataset for email summarization (Ulrich
Few-Shot Summarization As the datasets we
                                                         et al., 2008) but provided only a small sample of 40
introduce are not on a scale with larger datasets,
                                                         email threads, for which we provide transfer testing
we focus on few-shot and domain transfer summa-
                                                         results.
rization techniques. Wang et al. (2019) examine do-
                                                            We generally follow the guidance of Tomasoni
main adaptation in extractive summarization, while
                                                         and Huang (2010), from summarizing community
Hua and Wang (2017) examine domain adaptation
                                                         question answering forums, for determining which
between opinion and news summarization. Within
                                                         subsets of data to select from the above datasets.
unsupervised abstractive summarization, several
                                                         We remove an example if (1) there were less than
approaches have made use of variational autoen-
                                                         five posts (four in the case of email threads; “post”
coders (Baziotis et al., 2019; Chu and Liu, 2019;
                                                         refers to any answer, comment, or email); (2) the
Bražinskas et al., 2020) and pretrained language
                                                         longest post was over 400 words; (3) the sum of
models (Zhou and Rush, 2019; Laban et al., 2020).
                                                         all post lengths was outside of [100, 1400] words
   Recent work in abstractive (Zhang et al., 2019;       (although we extended this maximum length for
Fabbri et al., 2020a) and extractive-compressive         NYT comments); or (4) the average length of the
summarization (Desai et al., 2020) has shown the         posts was outside of the [50, 300] words interval.
power of pretrained models for a few-shot transfer.      For Stack data, we first filtered answers which re-
The quality of models trained on several hundred         ceived a negative community rating, as defined by
examples in these papers is comparable to that of        the number of user upvotes minus the number of
models trained on the equivalent full datasets. Thus,    user downvotes. While real-world settings may
we believe that introducing curated validation and       contain much longer threads, we later show that
testing datasets consisting of a few hundred exam-       this setting is already challenging.
ples is a valuable contribution within the current
paradigm, which was confirmed by the poor perfor-        Annotation Protocol We designed annotation
mance of models transferred from other domains           instructions for crowdsourced workers to write
compared to that trained on this validation data.        abstractive summaries for each of the four

                                                    6868
Dataset          % novel n-grams              Extractive Oracle            Summary Length     Input Length    # Docs/Example
  NYT             36.11/79.72/94.52            36.26/10.21/31.23                 79                1624             16.95
 Reddit           43.84/84.98/95.65            35.74/10.45/30.74                 65                 641              7.88
  Stack           35.12/77.91/93.56            37.30/10.70/31.93                 73                1207             9.72
  Email           42.09/83.27/93.98            40.98/15.50/35.22                 74                 917              4.95

Table 2: Statistics across dataset sources in ConvoSumm, showing novel uni/bi/tri-grams, ROUGE-1/2/L extractive
oracle scores, the average input and summary lengths (number of tokens), as well as the number of documents per
example, where each comment/post/answer/email is considered a document.

 Dataset/Method    Inter-document Similarity   Redundancy   Layout Bias
      NYT                    -11.71               -0.23      0.2/0.5/0.3
                                                                              ysis of the given input rather than another response
     Reddit                   -7.56               -0.49      0.2/0.5/0.2      or utterance; (2) summaries should be abstractive,
      Stack                   -9.59               -0.27      0.2/0.3/0.4
     Email                    -1.76               -0.18      0.3/0.4/0.3      i.e., annotators were required to paraphrase and
                                                                              could not repeat more than five words in a row from
Table 3: Multi-document summarization-specific                                the source; and (3) summary lengths should contain
dataset analysis on our proposed datasets with metrics
                                                                              [40, 90] tokens. Following the issues–viewpoints–
introduced in Dey et al. (2020a): inter-document simi-
larity (father from zero is less similarity), redundancy
                                                                              assertions framework presented in Barker and
(father from zero is less overall redundancy of semantic                      Gaizauskas (2016b), we also instructed annotators
units), and start/middle/end layout bias.                                     that summaries should summarize all viewpoints in
                                                                              the input and should try to include specific details
                                                                              from assertions and anecdotes (unless this made
datasets, motivated by work in summarizing view-                              the summary too lengthy). Summarizing based on
points present in online conversation (Barker and                             similar viewpoints is analogous to clustering then
Gaizauskas, 2016a). We present the crowdsource                                summarizing, similar to the comment label group-
workers with the data threads, along with any avail-                          ing procedure before summarization in Barker et al.
able metadata. For NYT, we presented the workers                              (2016b). To help with this, we recommended word-
with the article headline, keywords, and, rather than                         ing such as “Most commenters suggest that...” and
providing the entire article as context, an extrac-                           “Some commenters think that...” to group responses
tive BERT-based summary (Miller, 2019) of the                                 with similar viewpoints.
article. We use a BERT summary to give the anno-                                 However, the email dataset was unique among
tators an idea of the topic of the article. We avoided                        the selected datasets given that it contained more
having annotators read the entire article since the                           back-and-forth dialogue than clusters of view-
focus of their summaries was solely the content                               points, and thus identifying the speakers was essen-
of the comments as per the annotation protocols,                              tial to creating summaries that still retained mean-
and reading the entire article could end up intro-                            ing from the original email dialogue. Since the
ducing information in the summaries that was not                              email threads contained fewer individual speakers
necessarily representative of the comments’ main                              than the other datasets, this sort of summarization
points. We found that these summaries were use-                               remained feasible. Thus, for this dataset, annota-
ful in initial in-house annotations, and allowed us                           tors were instructed to specify the speakers when
to better understand the context of the comments                              summarizing the conversation.
being summarized. For Reddit and Stack, question
tags and information about the subforum were pro-                             Quality-Controlled Crowdsourcing We crowd-
vided; the Stack data includes both answers and                               sourced our data using Amazon Mechanical Turk.
answer comments. Reddit data was filtered simply                              We required that our workers be native English
on word limits due to the unavailability of up/down                           speakers and pass a qualifying exam for each do-
votes from the Coarse Discourse data. Stack data                              main to be summarized. We worked with a select
includes the prompt/title as well. Whenever pos-                              group of about 15 workers who formed a com-
sible, we included username information and the                               munity of high-quality annotators. Example sum-
scores of all comments, posts, and answers.                                   maries were provided to the workers. The workers
   Although the instructions differed slightly with                           submitted the qualifying exam, and then one of
the specific nuances of each dataset, they had stan-                          the authors of this paper provided feedback. If the
dard overall rules: (1) summaries should be an anal-                          worker was not sure of the quality of the summaries

                                                                           6869
written, at any point, they could enlist the input of     sures the similarity of multi-sentential documents
one of the authors.                                       with the reference. For more precise definitions,
   Additionally, after the workers wrote all sum-         we refer the reader to Dey et al. (2020a). We pro-
maries, we manually reviewed every summary and            vide results for our data in Table 3. Email data
made corrections to grammar, wording, and overall         exhibits the most inter-document similarity, which
structure. Summaries we could not fix ourselves,          follows the intuition that an email thread consists
either because they were poorly written or did not        of a focused discussion typically on a single topic.
follow the annotation protocols, were flagged to be       For redundancy, we see Reddit shows the most uni-
re-written. They were then sent to our approved           form distribution of semantic units, perhaps due
group of workers to be re-written, excluding any          to Reddit threads’ less focused nature compared
workers who had written a flagged summary. While          to the remaining datasets. We do not see a partic-
data crowdsourced from non-experts may contain            ularly strong layout bias across any parts of the
noise (Gillick and Liu, 2010), we believe that our        input documents. Our datasets exhibit greater or
setup of working closely with a small group of            comparable levels of novel-ngrams compared to
workers, providing feedback to individual work-           multi-document summarization datasets such as
ers, and manually reviewing all final summaries           MultiNews (Fabbri et al., 2019) and CQASUMM
mitigates these issues.                                   (Chowdhury and Chakraborty, 2018). Our Stack
                                                          subset has lower inter-document similarity, which
Dataset Statistics We provide statistics in Ta-           presents challenges for models which rely strictly
ble 2. The percentage of novel n-grams in our             on redundancy in the input, and our datasets gener-
summaries is higher than that of the very ab-             ally exhibit less layout bias, when compared to the
stractive XSum dataset (Narayan et al., 2018)             analysis done in Dey et al. (2020b).
(35.76/83.45/95.50 -% novel uni/bi/tri-grams).
This level of abstraction is likely due to the in-        Comparison to Existing Datasets Although
structions to perform abstractive summarization           previous work on conversation summarization, be-
and the summaries being an analysis of the input,         fore the introduction of SAMSum (Gliwa et al.,
which results in the insertion of new words (e.g.         2019b), has largely featured unsupervised or few-
“commenters” likely isn’t seen in the input). The in-     shot methods, there exist several datasets with ref-
fluence of this abstraction is further seen by an anal-   erence summaries. These include SENSEI (Barker
ysis of the Extractive Oracle, for which we show          et al., 2016b) for news comments, the Argumen-
ROUGE-1/2/L (Lin, 2004). We see that the perfor-          tative Dialogue Summary Corpus (ADS) (Misra
mance of an extractive model is above the Extrac-         et al., 2015) for discussion forums, and the BC3
tive Oracle on the very abstractive XSum (Narayan         (Ulrich et al., 2009) dataset for email data. How-
et al., 2018) (29.79 ROUGE-1), but much lower             ever, much of the existing datasets are not wide
than the Extractive Oracle on the CNN-DailyMail           in scope. For example, SENSEI only covers six
(CNNDM) dataset (Nallapati et al., 2016) (>50             topics and the ADS Corpus covers one topic and
ROUGE-1). The summary lengths are fairly con-             only has 45 dialogues. Furthermore, they each per-
sistent, while the input lengths are the longest for      tain to one subdomain of conversation. Our dataset
NYT and Stack data. We include the title and addi-        avoids these issues by covering four diverse subdo-
tional meta-data such as the headline and snippet         mains of conversation and having approximately
in NYT data in input length calculations.                 500 annotated summaries for each subdomain. Ad-
   We analyze multi-document summarization–               ditionally, since neural abstractive summarization
specific characteristics of our datasets, as proposed     baselines do not exist for these datasets, we bench-
by Dey et al. (2020a). In particular, inter-document      mark our models on these datasets to further their
similarity measures the degree of overlap of seman-       use as test sets. We similarly include the AMI and
tic units in the candidate documents, with scores         ICSI meeting datasets within our benchmark.
further from zero signifying less overlap. The no-           Within community question answering, the Wik-
tion introduced for redundancy measures the over-         iHowQA dataset (Deng et al., 2020) consists of
all distribution of semantic units; the farther the       user response threads to non-factoid questions start-
score is from zero, the more uniform semantic units       ing with “how to,” including labels for the an-
are across the entire input, with the maximum when        swer selection task and reference summaries. The
each unit is present only once. Layout bias mea-          CQASUMM dataset (Chowdhury and Chakraborty,

                                                     6870
Figure 1: Sample argument subgraph construct from NYT news comments illustrating varying viewpoints. Claims
“I honestly...” and “but I dont..” are entailed by premises, connected through Default Inference nodes, and
opposing claims are connected through Issue nodes.

2018) sampled threads from Yahoo! Answers in             or the construction of the final graph based on the
which the best answer could be used as a reference       identified nodes and edges. To adapt this formula-
summary. However, this heuristic is not guaranteed       tion to our multi-document setting, we first perform
to cover all the user answers’ perspectives, so we       argument extraction and relationship type classi-
believe our dataset is a more principled benchmark       fication for each individual input document and
for community question answering.                        finally graph construction to determine relation-
   It is also noted that several large-scale MDS         ships among claims from all documents.
datasets have been introduced in the news domain
(Fabbri et al., 2019; Gu et al., 2020; Gholipour Gha-    Argument Extraction For extracting arguments
landari et al., 2020), for creating Wikipedia lead-      from a single document, we build on work in argu-
paragraphs (Liu et al., 2018), and for long-form         ment mining with pretrained models (Chakrabarty
question answering (Fan et al., 2019). However,          et al., 2019). As in Lenz et al. (2020), our argumen-
these do not focus on the conversational domain.         tative units are sentences, from which we identify
                                                         claims, which are assertions that something is true,
4   Argument Graph Summarization                         and premises, which are propositions from which a
As our annotation protocol is motivated by the           conclusion is drawn. Additionally, we identify and
issues-viewpoints-assertions framework proposed          remove non-argumentative units. We train a three-
in Barker and Gaizauskas (2016a), we propose to          way classifier for the task of argument extraction,
instantiate a modified version of that work’s theo-      following Chakrabarty et al. (2019) and making
retical, proposed graph model.                           use of data for argument mining from that paper
                                                         and from Stab and Gurevych (2014). The output
Argument Graph Construction We build on                  of this step can also simply be used without further
the argument graph formulation of Lenz et al.            graph construction as a less noisy version of the
(2020), a variant of Argument Interchange Format         input, which we call -arg-filtered.
(Chesnevar et al., 2006). Claims and premises are
represented as information nodes (I-nodes), with         Relationship Type Classification We follow the
the relations between them represented as scheme         procedure in Lenz et al. (2020) and use entailment
nodes (S-nodes). Let V = I ∪ S be the set of             to determine the relationship between argumen-
nodes, and E ⊂ V × V the set of edges describing         tative units within a document. However, rather
support relationships among the nodes. We then           than using the classifier provided, we make use
define the argument graph G = (V, E).                    of RoBERTa (Liu et al., 2019b) fine-tuned on the
   Lenz et al. (2020) breaks the construction of the     MNLI entailment dataset (Williams et al., 2018).
argument graph down into four steps: (1) argument        Rather than using both support and contradiction
extraction, or the identification of argumentative       edges between claims and premises, we make the
discourse units; (2) relationship type classification,   simplification that all relationships can be captured
or the classification of edges between nodes; (3)        with support edges, as we are dealing with a single
major claim detection; and (4) graph construction,       document in this step. Within a single text, the

                                                    6871
Dataset/Method       Lexrank           Textrank           BERT-ext               Data/Method        BART               BART-arg
     NYT         22.30/3.87/19.14   25.11/3.75/20.61   25.88/3.81/22.00
    Reddit       22.71/4.52/19.38   24.38/4.54/19.84   24.51/4.18/20.95
                                                                                     NYT       35.91/9.22/31.28     36.60/9.83/32.61
     Stack       26.30/5.62/22.27   25.43/4.40/20.58   26.84/4.63/22.85             Reddit     35.50/10.64/32.57   36.39/11.38/33.57
    Email        16.04/3.68/13.38   19.50/3.90/16.18   25.46/6.17/21.73             Stack      39.61/10.98/35.35   39.73/11.17/35.52
                                                                                    Email      41.46/13.76/37.70   40.32/12.97/36.90
Table 4: ROUGE-1/2/L results for extractive LexRank
(Erkan and Radev, 2004), TextRank (Mihalcea and Ta-                          Table 5: ROUGE-1/2/L results for vanilla BART as
rau, 2004), and BERT-based (Miller, 2019) models.                            well as one trained on argument-mining input. Both
                                                                             are trained on 200 points from ConvoSumm.
premise can be tied as following from one of the
claims. We create an edge between any premise                                a dummy ‘Conversation Node’ which serves as the
and the claim it most entails if the entailment score                        root of the argument graph. We show an example
from RoBERTa is greater than 0.33, based on man-                             Issue subgraph for NYT data in Figure 1.
ual analysis of the scores. If a premise is not labeled
as supporting a claim, then we heuristically create                          Argument Graphs to Summaries Recent work
an edge between that premise and the closest claim                           has shown the strength of text-based pretrained
preceding it in the text.                                                    models on graph-to-text problems (Ribeiro et al.,
   Since not all texts in the benchmark datasets may                         2020). Following that work, we linearize the graph
be argumentative or may be too short to contain                              by following a depth-first approach starting from
major claims, we use some heuristics in our graph                            the Conversation Node. We found that inserting
creation. If none of the argumentative sentences are                         special tokens to signify edge types did not im-
labeled as claims (i.e., all are labeled as premises)                        prove performance, likely due to the size of our
in argument extraction, the text’s first sentence is                         data, and simply make use of an arrow → to sig-
labeled as the claim. Furthermore, we do not iden-                           nify the relationship between sentences. We train
tify a single claim as the major claim since there                           a sequence-to-sequence model on our linearized
may be multiple major points of discussion.                                  graph input, which we call -arg-graph.

Graph Construction For the final graph, for                                  5     Experimental Settings
each of the documents in an example, we run the
above procedure and obtain a set of claims and as-                           We use the fairseq codebase (Ott et al., 2019) for
sociated premises. We then identify support edges                            our experiments. Our base abstractive text summa-
between claims, which may be across documents.                               rization model is BART-large (Lewis et al., 2020),
One claim may make a larger assertion, which is                              a pretrained denoising autoencoder with 336M pa-
supported by other claims. We run our entailment                             rameters that builds on the sequence-to-sequence
model over all potential edges (in both directions)                          transformer of Vaswani et al. (2017). We fine-
among claims in the document and greedily add                                tune BART using a polynomial decay learning rate
edges according to the entailment support score                              scheduler with Adam optimizer (Kingma and Ba,
while no cycles are made. After this step, we are                            2015). We used a learning rate of 3e-5 and warmup
left with a set of claims which do not entail any                            and total updates of 20 and 200, following previ-
other nodes or, stated otherwise, do not have parent                         ous few-shot transfer work (Fabbri et al., 2020a).
nodes. Following the terminology of Barker and                               We could have equally fine-tuned other pretrained
Gaizauskas (2016b), these nodes can be considered                            models such as Pegasus (Zhang et al., 2019) or
viewpoints.                                                                  T5 (Raffel et al., 2019), but Fabbri et al. (2020a)
   We then identify issues or topics on which the                            find that BART largely performs equally well in
viewpoints differ. We run our entailment model for                           few-shot settings when compared to Pegasus.
all parent claim nodes again in both directions over                            For the NYT and Stack datasets, which con-
these claims and identify nodes that contradict each                         tain sequences over the typical 1024 max encoder
other with probability over 0.33, based on manual                            length with which BART is trained, we copied the
analysis of the resulting graphs. We greedily add                            encoder positional embeddings to allow sequences
edges to maintain a tree structure, joining these                            up to length 2048. To address the input-length of
nodes to a special node, which we call the Issue                             meeting summaries, which range from 6k to 12k to-
node. All Issue nodes, as well as claims which are                           kens, we use the Longformer (Beltagy et al., 2020),
not connected to any Issue node, are connected to                            which allows for sequences up to length 16k to-

                                                                          6872
Method/Dataset            AMI                 ICSI                 Dataset/Method      Our results       Previous SOTA
       HMNet             53.02/18.57/-       46.28/10.60/-               SAMSum        52.27/27.82/47.92   49.30/25.60/47.70
      DDA-GCN            53.15/22.32/-             -                   CQASUMM         32.79/6.68/28.83    31.00/5.00/15.20
  Longformer-BART      54.20/20.72/51.36   43.03/12.14/40.26               BC3         39.59/13.98/21.20           -
 Longformer-BART-arg   54.47/20.83/51.74   44.17/11.69/41.33
                                                                           ADS         37.18/11.42/21.27           -
                                                                          SENSEI       34.57/7.08/16.80            -
Table 6: ROUGE-1/2/L results for DDA-GCN (Feng
et al., 2020) and HMNet (Zhu et al., 2020) on the AMI             Table 7: Benchmarking results on conversational
and ICSI meeting summarization dataset along with                 datasets such as SAMSum (Gliwa et al., 2019b) and
our Longformer and Longformer-arg models.                         CQASUMM (Chowdhury and Chakraborty, 2018) and
                                                                  initial neural abstractive summarization results for
                                                                  email (BC3) (Ulrich et al., 2008), debate discussion fo-
kens. We initialize the Longformer model with
                                                                  rums (ADS) (Misra et al., 2015), and news comments
BART parameters trained on the CNN-DailyMail                      (SENSEI) (Barker et al., 2016b).
dataset, as the meeting summarization datasets con-
tain fewer than 100 data points. We otherwise
fine-tune models from vanilla BART, following in-                 from modeling the argument structure or removing
tuition in few-shot summarization (Fabbri et al.,                 non-argumentative units. We provide full results
2020a) and based on initial experiments. In the                   for both variations in the Appendix.
tables which follow, ”-arg” refers to any model
trained with argument-mining-based input, and we                   Benchmarking Other Conversation Summa-
specify which -arg-graph or -arg-filtered settings                 rization Datasets We benchmark our models on
were used for each dataset below.                                 widely used meeting summarization datasets. Due
                                                                   to the input’s linear nature and the size of the meet-
6   Results                                                        ing transcripts, we found improved results using
                                                                  -arg-filtered to filter non-argumentative units rather
We provide results for baseline, unsupervised ex-                  than incorporating the graph structure. Results are
tractive models in Table 4. Lexrank (Erkan and                     shown in Table 6. The Longformer model performs
Radev, 2004) and Textrank (Mihalcea and Tarau,                     as well or better than previous state-of-the-art re-
2004), and BERT-ext (Miller, 2019), which makes                    sults on these datasets, despite not making use of
use of BERT (Devlin et al., 2019). The unsuper-                    more complex modeling structures, and we gener-
vised extractive models perform well below the                     ally see improvement with argument-mining.
extractive oracle performance, suggesting the diffi-                  As noted above, there exist prior datasets for
culty of content selection in this setting.                        dialogue, community question answering, email,
   We train BART on 200 examples from our vali-                    forum, and news comments summarization. We
dation set for abstractive models, using the remain-               benchmark results on these datasets in Table 7.
ing 50 as validation and test on the final test set of            We outperform prior work on SAMSum (Gliwa
250 examples. We tested zero-shot transfer from                    et al., 2019b), and CQASUMM (Chowdhury and
CNNDM and SAMSum in zero-shot settings, al-                        Chakraborty, 2018) with our BART and BART-arg-
though these resulted in a much lower performance                  graph models, respectively. We did not find im-
of about 28 ROUGE-1. Few-shot model perfor-                        provement on SAMSum with the BART-arg model
mance is shown in Table 5. The abstractive model                   due to the extremely short and focused nature
performs at or above the Extractive Oracle, sug-                   of the dialogues, analogous to email data perfor-
gesting the need for better abstractive models.                    mance. We also provide transfer results of BART
   We also train on our argument mining-based                      and BART-arg-graph models from our email and
approaches and show results in Table 5. We see                     news-comment data to BC3 (Ulrich et al., 2009),
ROUGE improvements when applying BART-arg-                        ADS (Misra et al., 2015), and SENSEI data (Barker
graph for Reddit, and Stack data. The -arg-filtered                et al., 2016b), for which no prior neural abstractive
variation (which, as defined in Section 4, is the less             summarization results existed.
noisy version of the input produced by the argu-
ment extraction step) outperformed the -arg-graph                 Human Evaluations We collect human judg-
variation on both email and NYT data. For email                   ment annotations for two of the four quality dimen-
data, however, this did not improve upon the BART                 sions studied in Kryscinski et al. (2019) and Fabbri
baseline, likely due to the dataset’s characteristics;            et al. (2020b), namely consistency and relevance.
email data is shorter and more linear, not benefiting             Consistency is defined as the factual alignment be-

                                                               6873
Target Dataset             BART                    BART-arg
                  Relevance Consistency     Relevance Consistency
                                                                         ing of the input texts’ structure. We provide results
     Reddit       3.39 (0.13) 3.40 (0.12)   3.47 (0.12) 3.41 (0.10)      for baseline models and propose to model the text’s
      AMI         4.07 (0.16) 3.67 (0.16)   4.13 (0.17) 3.70 (0.17)
                                                                         argument structure, showing that such structure
Table 8: Mean relevance and factual consistency anno-                    helps better quantify viewpoints in non-linear in-
tations for BART and BART-arg outputs on Reddit and                      put in both automatic and human evaluations. Our
AMI. Standard errors are reported in parentheses.                        analysis notes challenges in modeling relevance
                                                                         and consistency in abstractive conversation summa-
                                                                         rization when compared to news summarization.
tween the summary and the summarized source
text, while relevance is defined as the summary’s                        8       Ethical Considerations
ability to select important content; only relevant in-
formation and viewpoints should be included. We                          As we propose novel conversation summarization
did not include fluency as an initial inspection of                      datasets and modeling components, this section is
the data found fluency to be of very high quality,                       divided into the following two parts.
as has shown to be the case for pretrained models
in news summarization (Fabbri et al., 2020b). We                         8.1      New Dataset
did not include coherence as this was generally not                      Intellectual Properties and Privacy Rights All
an issue of concern in the initial analysis.                             data for our newly-introduced datasets are avail-
   We randomly select 25 random examples from                            able online; please see the following for New York
the Reddit corpus and ten examples from the AMI                          Times comment data2 , StackExchange data3 , and
corpus, and output from the BART and BART-arg-                           W3C email data4 . Reddit data is available via the
graph models. These data points were chosen to                           Google BigQuery tool5 .
demonstrate what characteristics are realized in dif-
ferences across ROUGE for argument-graph and                             Compensation for Annotators We compen-
argument-noise-reduction approaches. Ten exam-                           sated the Turkers approximately $12–$15 per hour.
ples were chosen from AMI due to the size of the                         We first annotated examples in-house to determine
input and annotation constraints. The annotator                          the required annotation speed. Typically, the sum-
sees the source article and randomly-ordered out-                        marization task took around 10 minutes, and we
put from the model and then rates the summaries                          compensated the workers from $2.25 to $3.00 per
for relevance and consistency on a Likert from 1                         task, depending on the domain and deadline re-
to 5, with 5 being the best score. We averaged                           quirements.
the score of three native English-speaking anno-                         Steps Taken to Avoid Potential Problems We
tators on each example and then across examples.                         interacted closely with the Turkers to ensure that
Results are shown in Table 8. We find that the                           compensation was fair and that the instructions
annotators prefer our argument mining-based ap-                          were clear. To maintain the quality of the dataset,
proaches in both dimensions. However, the results                        we manually reviewed the crowdsourced sum-
are close. Furthermore, the scores for relevance                         maries for language use. Initial investigation into
and consistency are rather low, especially on the                        Reddit data showed certain inappropriate language
Reddit dataset and when compared to results on the                       usage, so we filtered these examples automatically.
CNN-DailyMail Dataset from Fabbri et al. (2020b).
These results demonstrate the difficulty of mod-                         8.2      NLP Application
eling such conversational data. Examples are in-                         Bias Biases may exist in the datasets, such as po-
cluded in the appendix.                                                  litical bias in the news datasets and gender bias in
7    Conclusion                                                          potentially all of the datasets. Thus, models trained
                                                                         on these datasets may propagate these biases. We
We propose ConvoSumm, a benchmark of four                                    2
                                                                             https://www.kaggle.com/aashita/
new, crowdsourced conversation datasets and state-                       nyt-comments
of-the-art baselines on widely-used datasets that                          3
                                                                             https://archive.org/download/
promote more unified progress in summarization                           stackexchange
                                                                           4
beyond the news domain. Our benchmark consists                               https://tides.umiacs.umd.edu/webtrec/
                                                                         trecent/parsed_w3c_corpus.html
of high-quality, human-written summaries that call                         5
                                                                             https://console.cloud.google.com/
for abstractive summaries and a deeper understand-                       bigquery

                                                                      6874
removed data with offensive language when possi-           2016b. The SENSEI annotated corpus: Human sum-
ble.                                                       maries of reader comment conversations in on-line
                                                           news. In Proceedings of the 17th Annual Meeting
Misuse Potential and Failure Mode When                     of the Special Interest Group on Discourse and Di-
                                                           alogue, pages 42–52, Los Angeles. Association for
used as intended, applying the summarization mod-          Computational Linguistics.
els described in this paper can save people much
time. However, the current models are still prone        Christos Baziotis, Ion Androutsopoulos, Ioannis
to producing hallucinated summaries, and in such           Konstas, and Alexandros Potamianos. 2019. SEQˆ3:
                                                           Differentiable sequence-to-sequence-to-sequence
a case, they may contribute to misinformation on
                                                           autoencoder for unsupervised abstractive sentence
the internet. Further research is needed to ensure         compression. In Proceedings of the 2019 Con-
the faithfulness of abstractive summaries to address       ference of the North American Chapter of the
this issue, as this issue is present among all current     Association for Computational Linguistics: Human
abstractive summarization models.                          Language Technologies, Volume 1 (Long and Short
                                                           Papers), pages 673–681, Minneapolis, Minnesota.
                                                           Association for Computational Linguistics.
Environmental Cost The experiments described
in the paper make use of V100 GPUs. We used              Iz Beltagy, Matthew E. Peters, and Arman Cohan.
up to 8 GPUs per experiment (depending on the               2020. Longformer: The long-document transformer.
experiment; sometimes, a single GPU was used to             arXiv:2004.05150.
run the maximum number of experiments in paral-
                                                         Arthur Bražinskas, Mirella Lapata, and Ivan Titov.
lel). The experiments may take up to a couple of           2020. Unsupervised opinion summarization as
hours for the larger datasets. Several dozen experi-       copycat-review generation. In Proceedings of the
ments were run due to parameter search, and future         58th Annual Meeting of the Association for Compu-
work should experiment with distilled models for           tational Linguistics, pages 5151–5169, Online. As-
                                                           sociation for Computational Linguistics.
more light-weight training. We note that while our
work required extensive experiments to draw sound        Tuhin Chakrabarty, Christopher Hidey, Smaranda
conclusions, future work will be able to draw on           Muresan, Kathy McKeown, and Alyssa Hwang.
these insights and need not run as many large-scale        2019. AMPERSAND: Argument mining for PER-
                                                           SuAsive oNline discussions. In Proceedings of the
comparisons. Models in production may be trained           2019 Conference on Empirical Methods in Natu-
once for use using the most promising settings.            ral Language Processing and the 9th International
                                                           Joint Conference on Natural Language Processing
                                                           (EMNLP-IJCNLP), pages 2933–2943, Hong Kong,
References                                                 China. Association for Computational Linguistics.

Emma Barker and Robert Gaizauskas. 2016a. Sum-           Jiaao Chen and Diyi Yang. 2020. Multi-view sequence-
  marizing multi-party argumentative conversations          to-sequence models with conversational structure
  in reader comment on news. In Proceedings of              for abstractive dialogue summarization. In Proceed-
  the Third Workshop on Argument Mining (ArgMin-            ings of the 2020 Conference on Empirical Methods
  ing2016), pages 12–20, Berlin, Germany. Associa-          in Natural Language Processing (EMNLP), pages
  tion for Computational Linguistics.                       4106–4118, Online. Association for Computational
                                                            Linguistics.
Emma Barker and Robert Gaizauskas. 2016b. Sum-
  marizing multi-party argumentative conversations       Carlos Chesnevar, Sanjay Modgil, Iyad Rahwan, Chris
  in reader comment on news. In Proceedings of             Reed, Guillermo Simari, Matthew South, Gerard
  the Third Workshop on Argument Mining (ArgMin-           Vreeswijk, Steven Willmott, et al. 2006. Towards
  ing2016), pages 12–20, Berlin, Germany. Associa-         an argument interchange format. The knowledge en-
  tion for Computational Linguistics.                      gineering review, 21(4):293–316.

Emma Barker, Monica Lestari Paramita, Ahmet Aker,        Tanya Chowdhury and Tanmoy Chakraborty. 2018.
  Emina Kurtic, Mark Hepple, and Robert Gaizauskas.        Cqasumm: Building references for community ques-
  2016a. The SENSEI annotated corpus: Human sum-           tion answering summarization corpora.
  maries of reader comment conversations in on-line
  news. In Proceedings of the 17th Annual Meeting        Eric Chu and Peter J. Liu. 2019. Meansum: A neu-
  of the Special Interest Group on Discourse and Di-        ral model for unsupervised multi-document abstrac-
  alogue, pages 42–52, Los Angeles. Association for         tive summarization. In Proceedings of the 36th In-
  Computational Linguistics.                                ternational Conference on Machine Learning, ICML
                                                           2019, 9-15 June 2019, Long Beach, California, USA,
Emma Barker, Monica Lestari Paramita, Ahmet Aker,          volume 97 of Proceedings of Machine Learning Re-
  Emina Kurtic, Mark Hepple, and Robert Gaizauskas.         search, pages 1223–1232. PMLR.

                                                    6875
Nick Craswell, Arjen P de Vries, and Ian Soboroff.          and few-shot abstractive summarization with inter-
  2005. Overview of the trec 2005 enterprise track.         mediate fine-tuning and data augmentation. arXiv
  In TREC, volume 5, pages 199–205.                         preprint arXiv:2010.12836.

Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen,           Alexander R Fabbri, Wojciech Kryściński, Bryan
  Yaliang Li, Min Yang, and Ying Shen. 2020. Joint          McCann, Caiming Xiong, Richard Socher, and
  learning of answer selection and answer summary           Dragomir Radev. 2020b.       Summeval:     Re-
  generation in community question answering. In            evaluating summarization evaluation.     arXiv
  The Thirty-Fourth AAAI Conference on Artificial In-       preprint arXiv:2007.12626.
  telligence, AAAI 2020, The Thirty-Second Innova-
  tive Applications of Artificial Intelligence Confer-    Angela Fan, Yacine Jernite, Ethan Perez, David Grang-
  ence, IAAI 2020, The Tenth AAAI Symposium on Ed-          ier, Jason Weston, and Michael Auli. 2019. ELI5:
  ucational Advances in Artificial Intelligence, EAAI       Long form question answering. In Proceedings of
  2020, New York, NY, USA, February 7-12, 2020,             the 57th Annual Meeting of the Association for Com-
  pages 7651–7658. AAAI Press.                              putational Linguistics, pages 3558–3567, Florence,
                                                            Italy. Association for Computational Linguistics.
Shrey Desai, Jiacheng Xu, and Greg Durrett. 2020.         Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin-
  Compressive summarization with plausibility and           wei Geng, and Ting Liu. 2020.           Dialogue
  salience modeling. In Proceedings of the 2020 Con-        discourse-aware graph convolutional networks for
  ference on Empirical Methods in Natural Language          abstractive meeting summarization. arXiv preprint
  Processing (EMNLP), pages 6259–6274, Online. As-          arXiv:2012.03502.
  sociation for Computational Linguistics.
                                                          Cecilia E Ford. 1991. Linguistics: The cambridge sur-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               vey: Volume 4. language: The socio-cultural context.
   Kristina Toutanova. 2019. BERT: Pre-training of          frederick h. newmeyer (ed.). Studies in Second Lan-
   deep bidirectional transformers for language under-      guage Acquisition, 13(3):412–413.
   standing. In Proceedings of the 2019 Conference
   of the North American Chapter of the Association       Michel Galley. 2006. A skip-chain conditional random
   for Computational Linguistics: Human Language            field for ranking meeting utterances by importance.
  Technologies, Volume 1 (Long and Short Papers),           In Proceedings of the 2006 Conference on Empiri-
   pages 4171–4186, Minneapolis, Minnesota. Associ-         cal Methods in Natural Language Processing, pages
   ation for Computational Linguistics.                     364–372, Sydney, Australia. Association for Com-
                                                            putational Linguistics.
Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan-
  moy Chakraborty. 2020a. Corpora evaluation and          Prakhar Ganesh and Saket Dingliwal. 2019. Abstrac-
  system bias detection in multi-document summariza-        tive summarization of spoken and written conversa-
  tion. In Findings of the Association for Computa-         tion. CoRR, abs/1902.01615.
  tional Linguistics: EMNLP 2020, pages 2830–2840,
  Online. Association for Computational Linguistics.      Demian Gholipour Ghalandari, Chris Hokamp,
                                                            Nghia The Pham, John Glover, and Georgiana Ifrim.
Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan-            2020. A large-scale multi-document summarization
  moy Chakraborty. 2020b. Corpora evaluation and            dataset from the Wikipedia current events portal.
  system bias detection in multi document summariza-        In Proceedings of the 58th Annual Meeting of the
  tion. In Proceedings of the 2020 Conference on            Association for Computational Linguistics, pages
  Empirical Methods in Natural Language Processing:         1302–1308, Online. Association for Computational
  Findings, pages 2830–2840, Online. Association for        Linguistics.
  Computational Linguistics.                              Dan Gillick and Yang Liu. 2010. Non-expert evalua-
                                                            tion of summarization systems is risky. In Proceed-
Günes Erkan and Dragomir R Radev. 2004. Lexrank:           ings of the NAACL HLT 2010 Workshop on Creating
  Graph-based lexical centrality as salience in text        Speech and Language Data with Amazon’s Mechan-
  summarization. Journal of artificial intelligence re-     ical Turk, pages 148–151, Los Angeles. Association
  search, 22:457–479.                                       for Computational Linguistics.
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and     Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
  Dragomir Radev. 2019. Multi-news: A large-scale           Aleksander Wawer. 2019a. SAMSum corpus: A
  multi-document summarization dataset and abstrac-         human-annotated dialogue dataset for abstractive
  tive hierarchical model. In Proceedings of the 57th       summarization. In Proceedings of the 2nd Workshop
  Annual Meeting of the Association for Computa-            on New Frontiers in Summarization, pages 70–79,
  tional Linguistics, pages 1074–1084, Florence, Italy.     Hong Kong, China. Association for Computational
  Association for Computational Linguistics.                Linguistics.
Alexander R Fabbri, Simeng Han, Haoyuan Li, Haoran        Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
  Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir           Aleksander Wawer. 2019b. SAMSum corpus: A
  Radev, and Yashar Mehdad. 2020a. Improving zero           human-annotated dialogue dataset for abstractive

                                                     6876
summarization. In Proceedings of the 2nd Workshop        9th International Joint Conference on Natural Lan-
  on New Frontiers in Summarization, pages 70–79,          guage Processing (EMNLP-IJCNLP), pages 540–
  Hong Kong, China. Association for Computational          551, Hong Kong, China. Association for Computa-
  Linguistics.                                             tional Linguistics.

Chih-Wen Goo and Yun-Nung Chen. 2018a. Ab-              Philippe Laban, Andrew Hsi, John Canny, and Marti A.
  stractive dialogue summarization with sentence-         Hearst. 2020. The summary loop: Learning to write
  gated modeling optimized by dialogue acts. In           abstractive summaries without examples. In Pro-
  2018 IEEE Spoken Language Technology Workshop           ceedings of the 58th Annual Meeting of the Asso-
  (SLT), pages 735–742. IEEE.                             ciation for Computational Linguistics, pages 5135–
                                                          5150, Online. Association for Computational Lin-
Chih-Wen Goo and Yun-Nung Chen. 2018b. Abstrac-           guistics.
  tive dialogue summarization with sentence-gated
  modeling optimized by dialogue acts.     CoRR,        Mirko Lenz, Premtim Sahitaj, Sean Kallenberg,
  abs/1809.05715.                                         Christopher Coors, Lorik Dumani, Ralf Schenkel,
                                                          and Ralph Bergmann. 2020. Towards an argu-
Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You        ment mining pipeline transforming texts to argument
  Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi           graphs. arXiv preprint arXiv:2006.04562.
  Zhai, and Nicholas Zukoski. 2020. Generating rep-
  resentative headlines for news stories. In WWW ’20:   Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
  The Web Conference 2020, Taipei, Taiwan, April 20-      jan Ghazvininejad, Abdelrahman Mohamed, Omer
  24, 2020, pages 1773–1784. ACM / IW3C2.                 Levy, Veselin Stoyanov, and Luke Zettlemoyer.
                                                          2020. BART: Denoising sequence-to-sequence pre-
Doris Hoogeveen, Karin M Verspoor, and Timothy            training for natural language generation, translation,
  Baldwin. 2015. Cqadupstack: A benchmark data            and comprehension. In Proceedings of the 58th An-
  set for community question-answering research. In       nual Meeting of the Association for Computational
  Proceedings of the 20th Australasian document com-      Linguistics, pages 7871–7880, Online. Association
  puting symposium, pages 1–8.                            for Computational Linguistics.

Xinyu Hua and Lu Wang. 2017. A pilot study of do-       Chin-Yew Lin. 2004. ROUGE: A package for auto-
  main adaptation effect for neural abstractive sum-      matic evaluation of summaries. In Text Summariza-
  marization. In Proceedings of the Workshop on           tion Branches Out, pages 74–81, Barcelona, Spain.
  New Frontiers in Summarization, pages 100–106,          Association for Computational Linguistics.
  Copenhagen, Denmark. Association for Computa-         Chunyi Liu, Peng Wang, Jiang Xu, Zang Li, and
  tional Linguistics.                                     Jieping Ye. 2019a. Automatic dialogue summary
                                                          generation for customer service. In Proceedings of
Adam Janin, Don Baron, Jane Edwards, Dan Ellis,
                                                          the 25th ACM SIGKDD International Conference on
  David Gelbart, Nelson Morgan, Barbara Peskin,
                                                          Knowledge Discovery & Data Mining, KDD 2019,
  Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke,
                                                          Anchorage, AK, USA, August 4-8, 2019, pages 1957–
  et al. 2003. The icsi meeting corpus. In 2003 IEEE
                                                          1965. ACM.
  International Conference on Acoustics, Speech, and
  Signal Processing, 2003. Proceedings.(ICASSP’03).,    Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
  volume 1, pages I–I. IEEE.                              Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
                                                          Shazeer. 2018. Generating wikipedia by summariz-
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A            ing long sequences. In 6th International Conference
  method for stochastic optimization. In 3rd Inter-       on Learning Representations, ICLR 2018, Vancou-
  national Conference on Learning Representations,        ver, BC, Canada, April 30 - May 3, 2018, Confer-
  ICLR 2015, San Diego, CA, USA, May 7-9, 2015,           ence Track Proceedings. OpenReview.net.
  Conference Track Proceedings.
                                                        Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Varada Kolhatkar and Maite Taboada. 2017. Using           dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  New York Times picks to identify constructive com-      Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
  ments. In Proceedings of the 2017 EMNLP Work-           Roberta: A robustly optimized bert pretraining ap-
  shop: Natural Language Processing meets Journal-        proach. arXiv preprint arXiv:1907.11692.
  ism, pages 100–105, Copenhagen, Denmark. Asso-
  ciation for Computational Linguistics.                Zhengyuan Liu, Angela Ng, Sheldon Lee Shao Guang,
                                                          Ai Ti Aw, and Nancy F. Chen. 2019c. Topic-aware
Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wil-        pointer-generator networks for summarizing spoken
  fried Post. 2005. The ami meeting corpus.               conversations. CoRR, abs/1910.01335.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-   Rada Mihalcea and Paul Tarau. 2004. TextRank:
 Cann, Caiming Xiong, and Richard Socher. 2019.           Bringing order into text. In Proceedings of the 2004
 Neural text summarization: A critical evaluation. In     Conference on Empirical Methods in Natural Lan-
 Proceedings of the 2019 Conference on Empirical          guage Processing, pages 404–411, Barcelona, Spain.
 Methods in Natural Language Processing and the           Association for Computational Linguistics.

                                                    6877
Derek Miller. 2019. Leveraging bert for extractive          Shasha Xie, Yang Liu, and Hui Lin. 2008. Evaluating
  text summarization on lectures. arXiv preprint              the effectiveness of features and sampling in extrac-
  arXiv:1906.04165.                                           tive meeting summarization. In 2008 IEEE Spoken
                                                              Language Technology Workshop, pages 157–160.
Amita Misra, Pranav Anand, Jean E. Fox Tree, and
 Marilyn Walker. 2015. Using summarization to dis-          Christian Stab and Iryna Gurevych. 2014. Identifying
 cover argument facets in online idealogical dialog.          argumentative discourse structures in persuasive es-
 In Proceedings of the 2015 Conference of the North           says. In Proceedings of the 2014 Conference on
 American Chapter of the Association for Computa-             Empirical Methods in Natural Language Processing
 tional Linguistics: Human Language Technologies,             (EMNLP), pages 46–56, Doha, Qatar. Association
 pages 430–440, Denver, Colorado. Association for             for Computational Linguistics.
 Computational Linguistics.
                                                            Manfred Stede, Stergos Afantenos, Andreas Peldszus,
Amita Misra, Pranav Anand, Jean E Fox Tree, and Mar-         Nicholas Asher, and Jérémy Perret. 2016. Parallel
 ilyn Walker. 2017. Using summarization to discover          discourse annotations on a corpus of short texts. In
 argument facets in online ideological dialog. arXiv         Proceedings of the Tenth International Conference
 preprint arXiv:1709.00662.                                  on Language Resources and Evaluation (LREC’16),
                                                             pages 1051–1058, Portorož, Slovenia. European
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
                                                             Language Resources Association (ELRA).
  Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac-
  tive text summarization using sequence-to-sequence        Mattia Tomasoni and Minlie Huang. 2010. Metadata-
  RNNs and beyond. In Proceedings of The 20th                aware measures for answer summarization in com-
  SIGNLL Conference on Computational Natural Lan-            munity question answering. In Proceedings of the
  guage Learning, pages 280–290, Berlin, Germany.            48th Annual Meeting of the Association for Compu-
  Association for Computational Linguistics.                 tational Linguistics, pages 760–769, Uppsala, Swe-
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.           den. Association for Computational Linguistics.
  2018. Don’t give me the details, just the summary!        J. Ulrich, G. Murray, and G. Carenini. 2008. A publicly
  topic-aware convolutional neural networks for ex-            available annotated corpus for supervised email sum-
  treme summarization. In Proceedings of the 2018              marization. In AAAI08 EMAIL Workshop, Chicago,
  Conference on Empirical Methods in Natural Lan-              USA. AAAI.
  guage Processing, pages 1797–1807, Brussels, Bel-
  gium. Association for Computational Linguistics.          Jan Ulrich, Giuseppe Carenini, Gabriel Murray, and
Myle Ott, Sergey Edunov, Alexei Baevski, Angela                Raymond Ng. 2009. Regression-based summariza-
 Fan, Sam Gross, Nathan Ng, David Grangier, and                tion of email conversations. In Proceedings of the
 Michael Auli. 2019. fairseq: A fast, extensible              International AAAI Conference on Web and Social
 toolkit for sequence modeling. In Proceedings of             Media, volume 3.
 the 2019 Conference of the North American Chap-            Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
 ter of the Association for Computational Linguistics         Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
 (Demonstrations), pages 48–53, Minneapolis, Min-             Kaiser, and Illia Polosukhin. 2017. Attention is all
 nesota. Association for Computational Linguistics.           you need. In Advances in Neural Information Pro-
Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and            cessing Systems 30: Annual Conference on Neural
  Raymond Ng. 2014. A template-based abstractive              Information Processing Systems 2017, December 4-
  meeting summarization: Leveraging summary and               9, 2017, Long Beach, CA, USA, pages 5998–6008.
  source text relationships. In Proceedings of the 8th
                                                            Danqing Wang, Pengfei Liu, Ming Zhong, Jie Fu,
  International Natural Language Generation Confer-
                                                              Xipeng Qiu, and Xuanjing Huang. 2019. Exploring
  ence (INLG), pages 45–53, Philadelphia, Pennsylva-
                                                              domain shift in extractive text summarization.
  nia, U.S.A. Association for Computational Linguis-
  tics.                                                     Lu Wang and Claire Cardie. 2013.             Domain-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine           independent abstract generation for focused meeting
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,             summarization. In Proceedings of the 51st Annual
  Wei Li, and Peter J Liu. 2019. Exploring the limits         Meeting of the Association for Computational Lin-
  of transfer learning with a unified text-to-text trans-     guistics (Volume 1: Long Papers), pages 1395–1405,
  former. arXiv preprint arXiv:1910.10683.                    Sofia, Bulgaria. Association for Computational Lin-
                                                              guistics.
Leonardo FR Ribeiro, Martin Schmitt, Hinrich Schütze,
  and Iryna Gurevych. 2020. Investigating pretrained        Adina Williams, Nikita Nangia, and Samuel Bowman.
  language models for graph-to-text generation. arXiv         2018. A broad-coverage challenge corpus for sen-
  preprint arXiv:2007.08426.                                  tence understanding through inference. In Proceed-
                                                              ings of the 2018 Conference of the North American
Evan Sandhaus. 2008. The new york times annotated             Chapter of the Association for Computational Lin-
  corpus. Linguistic Data Consortium, Philadelphia,           guistics: Human Language Technologies, Volume
  6(12):e26752.                                               1 (Long Papers), pages 1112–1122, New Orleans,

                                                       6878
You can also read