ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

Page created by Dan Hunter

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

ConvoSumm: Conversation Summarization Benchmark
and Improved Abstractive Summarization with Argument Mining
Alexander R. Fabbri[†] Faiaz Rahman[†] Imad Rizvi[†] Borui Wang[†]
Haoran Li [‡] Yashar Mehdad[‡] Dragomir Radev[†]
[†] Yale University [‡] Facebook AI
{alexander.fabbri, faiaz.rahman, imad.rizvi,
borui.wang, dragomir.radev}@yale.edu
{aimeeli, mehdad}@fb.com

Abstract Headline: SuperBowl
Snippet: Whether you’re a football fan or not, what do
you like about Super Bowl Sunday?
While online conversations can cover a vast
Comment: ... In my opinion I think the Falcons will
amount of information in many different for- stomp the patriots. I think Tom Brady will choke the Super
mats, abstractive text summarization has pri- Bowl. ...
marily focused on modeling solely news ar- Comment: I am big Arizona Cardinals fan so when they
ticles. This research gap is due, in part, to didn’t even make the playoffs i was upset. ...
the lack of standardized datasets for summa- Comment: I’m not a very big football fan at all. So
when it comes to Superbowl Sunday, I’m in it for the
rizing online discussions. To address this gap, commercials and the half time show. ...
we design annotation protocols motivated by Comment: I am not exactly a football fan, but I enjoy
an issues–viewpoints–assertions framework to watching the Super Bowl....
crowdsource four new datasets on diverse on- ...
line conversation forms of news comments, Summary:
Several commenters list their favorite things about the
discussion forums, community question an-
Super Bowl, including half-time shows, the funny com-
swering forums, and email threads. We bench- mercials, the Puppy Bowl, eating food, and spending time
mark state-of-the-art models on our datasets with family. A couple of commenters admit to not being
and analyze characteristics associated with the football fans but still enjoying the Super Bowl. Some com-
data. To create a comprehensive benchmark, menters discuss whether they thought the Falcons or the
Patriots were going to win, while others list teams they
we also evaluate these models on widely-used
wish were in the game.
conversation summarization datasets to estab-
lish strong baselines in this domain. Fur- Table 1: Example summary of comments from a New
thermore, we incorporate argument mining York Times article discussing people’s favorite parts of
through graph construction to directly model the Super Bowl. The summary is an analysis of the
the issues, viewpoints, and assertions present comments and quantifies the viewpoints present.
in a conversation and filter noisy input, show-
ing comparable or improved results according
to automatic and human evaluations. Unlike documents, articles, and scientific papers,
which contain specific linguistic structures and con-
1 Introduction
ventions such as topic sentences and abstracts, con-
Automatic text summarization is the process of versational text scatters main points across multiple
outputting the most salient parts of an input in a utterances and between numerous writers. As a
concise and readable form. Recent work in sum- result, the text summarization task in the conver-
marization has made significant progress due to sational data domain offers a challenging research
introducing large-scale datasets such as the CNN- field to test newly-developed models (Chen and
DailyMail dataset (Nallapati et al., 2016) and the Yang, 2020).
New York Times dataset (Sandhaus, 2008). Further- Recently, Gliwa et al. (2019a) introduced a
more, the use of large self-supervised pretrained dataset for chat-dialogue conversation summariza-
models such as BART (Lewis et al., 2020) and tion consisting of 16k examples, the first large-
Pegasus (Zhang et al., 2019) has achieved state- scale dataset of its kind. Previous work in con-
of-the-art performance across summarization tasks versation summarization was limited by the data
and strong performance in zero and few-shot set- available and focused primarily on meeting sum-
tings (Fabbri et al., 2020a). However, less work marization, such as the AMI (Kraaij et al., 2005)
has focused on summarizing online conversations. and ICSI (Janin et al., 2003) datasets. The datasets

6866
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing, pages 6866–6880
August 1–6, 2021. ©2021 Association for Computational Linguistics

used in recent conversation papers are often not uni-   ment structure (discussed in Related Work) for sum-
form, ranging from visual dialogue data (Goo and        marizing news comments. We construct this argu-
Chen, 2018a) to customer-service dialogues (Yuan        ment graph using entailment relations, linearize the
and Yu, 2019), not initially intended for summa-        graph, train a graph-to-text model (Ribeiro et al.,
rization. The availability of benchmark datasets for    2020), and experiment with argument mining as a
comparing methods has limited work in other con-        way to reduce noise in long-text input.
versation summarization domains and thus likely            Our contributions are the following: (1) we
inhibited progress (Kryscinski et al., 2019; Fabbri     crowdsource datasets for four domains of conver-
et al., 2020b).                                         sational data and analyze the characteristics of our
   We aim to address this research gap by crowd-        proposed datasets; (2) we benchmark state-of-the-
sourcing a suite of four datasets, which we call        art models on these datasets as well as previous
ConvoSumm, that can evaluate a model’s perfor-          widely-used conversation summarization datasets
mance on a broad spectrum of conversation data. In      to provide a clear baseline for future work; and
determining the domains of data to collect, we use      (3) we apply argument mining to model the struc-
the general definition of conversation as “any dis-     ture of our conversational data better as well as
course produced by more than one person” (Ford,         reduce noise in long-text input, showing compa-
1991). We identify several key categories of data       rable or improved results in both automatic and
for which standard human-created development            human evaluations.1
and testing datasets do not exist, namely (1) news
article comments, (2) discussion forums and debate,     2       Related Work
(3) community question answering, and (4) email
                                                        Modeling Conversation Summarization Early
threads. We design annotation protocols motivated
                                                        approaches to conversation summarization con-
by work in quantifying viewpoints present in news
                                                        sisted of feature engineering (Shasha Xie et al.,
comment data (Barker and Gaizauskas, 2016a) to
                                                        2008), template selection methods (Oya et al.,
crowdsource 250 development and 250 test exam-
                                                        2014), and statistical machine learning approaches
ples for each of the above domains. We provide an
                                                        (Galley, 2006; Wang and Cardie, 2013). More re-
example of comments to a New York Times news
                                                        cent modeling approaches for dialogue summariza-
article, and our crowdsourced summary in Table 1.
                                                        tion have attempted to take advantage of conver-
   In addition to introducing manually-curated          sation structures found within the data through di-
datasets for conversation summarization, we also        alogue act classification (Goo and Chen, 2018b),
aim to unify previous work in conversation summa-       discourse labeling (Ganesh and Dingliwal, 2019),
rization. Namely, we benchmark a state-of-the-art       topic segmentation (Liu et al., 2019c), and key-
abstractive model on several conversation datasets:     point analysis (Liu et al., 2019a). Chen and
dialogue summarization from SAMSum (Gliwa               Yang (2020) utilize multiple conversational struc-
et al., 2019b), heuristic-generated community ques-     tures from different perspectives in its sequence-to-
tion answering from CQASumm (Chowdhury and              sequence model. However, such approaches focus
Chakraborty, 2018), meeting summarization data          exclusively on dialogue summarization, and it is
from AMI and ICSI, and smaller test sets in the         not trivial to extend such methods to longer con-
news comments, discussion forum, and email do-          versations with many more participants. We thus
mains. We believe that such benchmarking will           introduce a method to model the structure of the
facilitate a more straightforward comparison of con-    discourse over the many-party conversation.
versation summarization models across domains.             Several existing works have focused on con-
   To unify modeling across these conversational        ceptualizing conversation structure for summa-
domains, we propose to use recent work in end-to-       rization and how to present this structure to end-
end argument mining (Lenz et al., 2020; Stab and        users. Barker et al. (2016a) propose a conversation
Gurevych, 2014; Chakrabarty et al., 2019) to instan-    overview summary that aims to capture the key
tiate the theoretical graph framework which moti-       argumentative content of a reader comment con-
vated our annotation protocol, proposed by Barker       versation. Misra et al. (2017) use summarization
and Gaizauskas (2016a) for conversation summa-              1
                                                            For reproducibility of our findings, we will make our data
rization. This protocol is employed to both identify    and code publicly available at https://github.com/
and use the “issues–viewpoints–assertions” argu-        Yale-LILY/ConvoSumm.

                                                   6867

as a means of probing online debates to discover         3   ConvoSumm
central propositions, which they cluster to identify
argument facets. Barker and Gaizauskas (2016b)           In this section, we introduce our dataset selection,
identify three key components of conversational di-      our annotation protocol, and the characteristics of
alogue: issues (that individuals discuss), viewpoints    our crowdsourced dataset.
(that they hold about these issues), and assertions      Data Selection For the news comments subdo-
(that they make to support their viewpoints). We         main, we use the NYT Comments dataset, which
build on this framework and advances in argument         consists of 2 million comments made on 9,000
mining for end-to-end training for summarization.        New York Times articles published between 2017
                                                         and 2018. It is publicly available and has been
Argument Mining Work in argument mining                  used in work for news-comment relevance mod-
(Stab and Gurevych, 2014) has aimed to iden-             eling (Kolhatkar and Taboada, 2017); it also con-
tify these argumentative units and classify them         tains metadata that may be of use in summarization
into claims, premises, and major claims, or claims       modeling. For the discussion forums and debate
describing the key concept in a text. More re-           subdomain, we select Reddit data from CoarseDis-
cently, Chakrabarty et al. (2019) propose to fine-       course (Zhang et al., 2017), which contains anno-
tune BERT (Devlin et al., 2019) for identifying ar-      tations about the discourse structure of the threads.
gumentative units and relationships between them         For the community question answering subdomain,
within a text and across texts. Lenz et al. (2020) are   we use StackExchange (Stack), which provides ac-
the first to propose an end-to-end approach for con-     cess to all forums and has been used in modeling
structing an argument graph (Stede et al., 2016),        for answer relevance and question deduplication
a structured representation of claims and premises       (Hoogeveen et al., 2015). We chose StackExchange
in an argumentative text; the graph is built by con-     over the commonly-used Yahoo! Answers data due
necting claim and premise argumentative discourse        to licensing reasons. For the email threads subdo-
units. We build on this framework for modeling           main, we use the publicly-available W3C corpus
discourse in conversational data.                        (Craswell et al., 2005). Previous work also made
                                                         use of this dataset for email summarization (Ulrich
Few-Shot Summarization As the datasets we
                                                         et al., 2008) but provided only a small sample of 40
introduce are not on a scale with larger datasets,
                                                         email threads, for which we provide transfer testing
we focus on few-shot and domain transfer summa-
                                                         results.
rization techniques. Wang et al. (2019) examine do-
                                                            We generally follow the guidance of Tomasoni
main adaptation in extractive summarization, while
                                                         and Huang (2010), from summarizing community
Hua and Wang (2017) examine domain adaptation
                                                         question answering forums, for determining which
between opinion and news summarization. Within
                                                         subsets of data to select from the above datasets.
unsupervised abstractive summarization, several
                                                         We remove an example if (1) there were less than
approaches have made use of variational autoen-
                                                         five posts (four in the case of email threads; “post”
coders (Baziotis et al., 2019; Chu and Liu, 2019;
                                                         refers to any answer, comment, or email); (2) the
Bražinskas et al., 2020) and pretrained language
                                                         longest post was over 400 words; (3) the sum of
models (Zhou and Rush, 2019; Laban et al., 2020).
                                                         all post lengths was outside of [100, 1400] words
   Recent work in abstractive (Zhang et al., 2019;       (although we extended this maximum length for
Fabbri et al., 2020a) and extractive-compressive         NYT comments); or (4) the average length of the
summarization (Desai et al., 2020) has shown the         posts was outside of the [50, 300] words interval.
power of pretrained models for a few-shot transfer.      For Stack data, we first filtered answers which re-
The quality of models trained on several hundred         ceived a negative community rating, as defined by
examples in these papers is comparable to that of        the number of user upvotes minus the number of
models trained on the equivalent full datasets. Thus,    user downvotes. While real-world settings may
we believe that introducing curated validation and       contain much longer threads, we later show that
testing datasets consisting of a few hundred exam-       this setting is already challenging.
ples is a valuable contribution within the current
paradigm, which was confirmed by the poor perfor-        Annotation Protocol We designed annotation
mance of models transferred from other domains           instructions for crowdsourced workers to write
compared to that trained on this validation data.        abstractive summaries for each of the four

                                                    6868

Dataset          % novel n-grams              Extractive Oracle            Summary Length     Input Length    # Docs/Example
  NYT             36.11/79.72/94.52            36.26/10.21/31.23                 79                1624             16.95
 Reddit           43.84/84.98/95.65            35.74/10.45/30.74                 65                 641              7.88
  Stack           35.12/77.91/93.56            37.30/10.70/31.93                 73                1207             9.72
  Email           42.09/83.27/93.98            40.98/15.50/35.22                 74                 917              4.95

Table 2: Statistics across dataset sources in ConvoSumm, showing novel uni/bi/tri-grams, ROUGE-1/2/L extractive
oracle scores, the average input and summary lengths (number of tokens), as well as the number of documents per
example, where each comment/post/answer/email is considered a document.

 Dataset/Method    Inter-document Similarity   Redundancy   Layout Bias
      NYT                    -11.71               -0.23      0.2/0.5/0.3
                                                                              ysis of the given input rather than another response
     Reddit                   -7.56               -0.49      0.2/0.5/0.2      or utterance; (2) summaries should be abstractive,
      Stack                   -9.59               -0.27      0.2/0.3/0.4
     Email                    -1.76               -0.18      0.3/0.4/0.3      i.e., annotators were required to paraphrase and
                                                                              could not repeat more than five words in a row from
Table 3: Multi-document summarization-specific                                the source; and (3) summary lengths should contain
dataset analysis on our proposed datasets with metrics
                                                                              [40, 90] tokens. Following the issues–viewpoints–
introduced in Dey et al. (2020a): inter-document simi-
larity (father from zero is less similarity), redundancy
                                                                              assertions framework presented in Barker and
(father from zero is less overall redundancy of semantic                      Gaizauskas (2016b), we also instructed annotators
units), and start/middle/end layout bias.                                     that summaries should summarize all viewpoints in
                                                                              the input and should try to include specific details
                                                                              from assertions and anecdotes (unless this made
datasets, motivated by work in summarizing view-                              the summary too lengthy). Summarizing based on
points present in online conversation (Barker and                             similar viewpoints is analogous to clustering then
Gaizauskas, 2016a). We present the crowdsource                                summarizing, similar to the comment label group-
workers with the data threads, along with any avail-                          ing procedure before summarization in Barker et al.
able metadata. For NYT, we presented the workers                              (2016b). To help with this, we recommended word-
with the article headline, keywords, and, rather than                         ing such as “Most commenters suggest that...” and
providing the entire article as context, an extrac-                           “Some commenters think that...” to group responses
tive BERT-based summary (Miller, 2019) of the                                 with similar viewpoints.
article. We use a BERT summary to give the anno-                                 However, the email dataset was unique among
tators an idea of the topic of the article. We avoided                        the selected datasets given that it contained more
having annotators read the entire article since the                           back-and-forth dialogue than clusters of view-
focus of their summaries was solely the content                               points, and thus identifying the speakers was essen-
of the comments as per the annotation protocols,                              tial to creating summaries that still retained mean-
and reading the entire article could end up intro-                            ing from the original email dialogue. Since the
ducing information in the summaries that was not                              email threads contained fewer individual speakers
necessarily representative of the comments’ main                              than the other datasets, this sort of summarization
points. We found that these summaries were use-                               remained feasible. Thus, for this dataset, annota-
ful in initial in-house annotations, and allowed us                           tors were instructed to specify the speakers when
to better understand the context of the comments                              summarizing the conversation.
being summarized. For Reddit and Stack, question
tags and information about the subforum were pro-                             Quality-Controlled Crowdsourcing We crowd-
vided; the Stack data includes both answers and                               sourced our data using Amazon Mechanical Turk.
answer comments. Reddit data was filtered simply                              We required that our workers be native English
on word limits due to the unavailability of up/down                           speakers and pass a qualifying exam for each do-
votes from the Coarse Discourse data. Stack data                              main to be summarized. We worked with a select
includes the prompt/title as well. Whenever pos-                              group of about 15 workers who formed a com-
sible, we included username information and the                               munity of high-quality annotators. Example sum-
scores of all comments, posts, and answers.                                   maries were provided to the workers. The workers
   Although the instructions differed slightly with                           submitted the qualifying exam, and then one of
the specific nuances of each dataset, they had stan-                          the authors of this paper provided feedback. If the
dard overall rules: (1) summaries should be an anal-                          worker was not sure of the quality of the summaries

                                                                           6869

written, at any point, they could enlist the input of sures the similarity of multi-sentential documents
one of the authors. with the reference. For more precise definitions,
Additionally, after the workers wrote all sum- we refer the reader to Dey et al. (2020a). We pro-
maries, we manually reviewed every summary and vide results for our data in Table 3. Email data
made corrections to grammar, wording, and overall exhibits the most inter-document similarity, which
structure. Summaries we could not fix ourselves, follows the intuition that an email thread consists
either because they were poorly written or did not of a focused discussion typically on a single topic.
follow the annotation protocols, were flagged to be For redundancy, we see Reddit shows the most uni-
re-written. They were then sent to our approved form distribution of semantic units, perhaps due
group of workers to be re-written, excluding any to Reddit threads’ less focused nature compared
workers who had written a flagged summary. While to the remaining datasets. We do not see a partic-
data crowdsourced from non-experts may contain ularly strong layout bias across any parts of the
noise (Gillick and Liu, 2010), we believe that our input documents. Our datasets exhibit greater or
setup of working closely with a small group of comparable levels of novel-ngrams compared to
workers, providing feedback to individual work- multi-document summarization datasets such as
ers, and manually reviewing all final summaries MultiNews (Fabbri et al., 2019) and CQASUMM
mitigates these issues. (Chowdhury and Chakraborty, 2018). Our Stack
subset has lower inter-document similarity, which
Dataset Statistics We provide statistics in Ta- presents challenges for models which rely strictly
ble 2. The percentage of novel n-grams in our on redundancy in the input, and our datasets gener-
summaries is higher than that of the very ab- ally exhibit less layout bias, when compared to the
stractive XSum dataset (Narayan et al., 2018) analysis done in Dey et al. (2020b).
(35.76/83.45/95.50 -% novel uni/bi/tri-grams).
This level of abstraction is likely due to the in- Comparison to Existing Datasets Although
structions to perform abstractive summarization previous work on conversation summarization, be-
and the summaries being an analysis of the input, fore the introduction of SAMSum (Gliwa et al.,
which results in the insertion of new words (e.g. 2019b), has largely featured unsupervised or few-
“commenters” likely isn’t seen in the input). The in- shot methods, there exist several datasets with ref-
fluence of this abstraction is further seen by an anal- erence summaries. These include SENSEI (Barker
ysis of the Extractive Oracle, for which we show et al., 2016b) for news comments, the Argumen-
ROUGE-1/2/L (Lin, 2004). We see that the perfor- tative Dialogue Summary Corpus (ADS) (Misra
mance of an extractive model is above the Extrac- et al., 2015) for discussion forums, and the BC3
tive Oracle on the very abstractive XSum (Narayan (Ulrich et al., 2009) dataset for email data. How-
et al., 2018) (29.79 ROUGE-1), but much lower ever, much of the existing datasets are not wide
than the Extractive Oracle on the CNN-DailyMail in scope. For example, SENSEI only covers six
(CNNDM) dataset (Nallapati et al., 2016) (>50 topics and the ADS Corpus covers one topic and
ROUGE-1). The summary lengths are fairly con- only has 45 dialogues. Furthermore, they each per-
sistent, while the input lengths are the longest for tain to one subdomain of conversation. Our dataset
NYT and Stack data. We include the title and addi- avoids these issues by covering four diverse subdo-
tional meta-data such as the headline and snippet mains of conversation and having approximately
in NYT data in input length calculations. 500 annotated summaries for each subdomain. Ad-
We analyze multi-document summarization– ditionally, since neural abstractive summarization
specific characteristics of our datasets, as proposed baselines do not exist for these datasets, we bench-
by Dey et al. (2020a). In particular, inter-document mark our models on these datasets to further their
similarity measures the degree of overlap of seman- use as test sets. We similarly include the AMI and
tic units in the candidate documents, with scores ICSI meeting datasets within our benchmark.
further from zero signifying less overlap. The no- Within community question answering, the Wik-
tion introduced for redundancy measures the over- iHowQA dataset (Deng et al., 2020) consists of
all distribution of semantic units; the farther the user response threads to non-factoid questions start-
score is from zero, the more uniform semantic units ing with “how to,” including labels for the an-
are across the entire input, with the maximum when swer selection task and reference summaries. The
each unit is present only once. Layout bias mea- CQASUMM dataset (Chowdhury and Chakraborty,

6870

Figure 1: Sample argument subgraph construct from NYT news comments illustrating varying viewpoints. Claims
“I honestly...” and “but I dont..” are entailed by premises, connected through Default Inference nodes, and
opposing claims are connected through Issue nodes.

2018) sampled threads from Yahoo! Answers in or the construction of the final graph based on the
which the best answer could be used as a reference identified nodes and edges. To adapt this formula-
summary. However, this heuristic is not guaranteed tion to our multi-document setting, we first perform
to cover all the user answers’ perspectives, so we argument extraction and relationship type classi-
believe our dataset is a more principled benchmark fication for each individual input document and
for community question answering. finally graph construction to determine relation-
It is also noted that several large-scale MDS ships among claims from all documents.
datasets have been introduced in the news domain
(Fabbri et al., 2019; Gu et al., 2020; Gholipour Gha- Argument Extraction For extracting arguments
landari et al., 2020), for creating Wikipedia lead- from a single document, we build on work in argu-
paragraphs (Liu et al., 2018), and for long-form ment mining with pretrained models (Chakrabarty
question answering (Fan et al., 2019). However, et al., 2019). As in Lenz et al. (2020), our argumen-
these do not focus on the conversational domain. tative units are sentences, from which we identify
claims, which are assertions that something is true,
4 Argument Graph Summarization and premises, which are propositions from which a
As our annotation protocol is motivated by the conclusion is drawn. Additionally, we identify and
issues-viewpoints-assertions framework proposed remove non-argumentative units. We train a three-
in Barker and Gaizauskas (2016a), we propose to way classifier for the task of argument extraction,
instantiate a modified version of that work’s theo- following Chakrabarty et al. (2019) and making
retical, proposed graph model. use of data for argument mining from that paper
and from Stab and Gurevych (2014). The output
Argument Graph Construction We build on of this step can also simply be used without further
the argument graph formulation of Lenz et al. graph construction as a less noisy version of the
(2020), a variant of Argument Interchange Format input, which we call -arg-filtered.
(Chesnevar et al., 2006). Claims and premises are
represented as information nodes (I-nodes), with Relationship Type Classification We follow the
the relations between them represented as scheme procedure in Lenz et al. (2020) and use entailment
nodes (S-nodes). Let V = I ∪ S be the set of to determine the relationship between argumen-
nodes, and E ⊂ V × V the set of edges describing tative units within a document. However, rather
support relationships among the nodes. We then than using the classifier provided, we make use
define the argument graph G = (V, E). of RoBERTa (Liu et al., 2019b) fine-tuned on the
Lenz et al. (2020) breaks the construction of the MNLI entailment dataset (Williams et al., 2018).
argument graph down into four steps: (1) argument Rather than using both support and contradiction
extraction, or the identification of argumentative edges between claims and premises, we make the
discourse units; (2) relationship type classification, simplification that all relationships can be captured
or the classification of edges between nodes; (3) with support edges, as we are dealing with a single
major claim detection; and (4) graph construction, document in this step. Within a single text, the

6871

Dataset/Method       Lexrank           Textrank           BERT-ext               Data/Method        BART               BART-arg
     NYT         22.30/3.87/19.14   25.11/3.75/20.61   25.88/3.81/22.00
    Reddit       22.71/4.52/19.38   24.38/4.54/19.84   24.51/4.18/20.95
                                                                                     NYT       35.91/9.22/31.28     36.60/9.83/32.61
     Stack       26.30/5.62/22.27   25.43/4.40/20.58   26.84/4.63/22.85             Reddit     35.50/10.64/32.57   36.39/11.38/33.57
    Email        16.04/3.68/13.38   19.50/3.90/16.18   25.46/6.17/21.73             Stack      39.61/10.98/35.35   39.73/11.17/35.52
                                                                                    Email      41.46/13.76/37.70   40.32/12.97/36.90
Table 4: ROUGE-1/2/L results for extractive LexRank
(Erkan and Radev, 2004), TextRank (Mihalcea and Ta-                          Table 5: ROUGE-1/2/L results for vanilla BART as
rau, 2004), and BERT-based (Miller, 2019) models.                            well as one trained on argument-mining input. Both
                                                                             are trained on 200 points from ConvoSumm.
premise can be tied as following from one of the
claims. We create an edge between any premise                                a dummy ‘Conversation Node’ which serves as the
and the claim it most entails if the entailment score                        root of the argument graph. We show an example
from RoBERTa is greater than 0.33, based on man-                             Issue subgraph for NYT data in Figure 1.
ual analysis of the scores. If a premise is not labeled
as supporting a claim, then we heuristically create                          Argument Graphs to Summaries Recent work
an edge between that premise and the closest claim                           has shown the strength of text-based pretrained
preceding it in the text.                                                    models on graph-to-text problems (Ribeiro et al.,
   Since not all texts in the benchmark datasets may                         2020). Following that work, we linearize the graph
be argumentative or may be too short to contain                              by following a depth-first approach starting from
major claims, we use some heuristics in our graph                            the Conversation Node. We found that inserting
creation. If none of the argumentative sentences are                         special tokens to signify edge types did not im-
labeled as claims (i.e., all are labeled as premises)                        prove performance, likely due to the size of our
in argument extraction, the text’s first sentence is                         data, and simply make use of an arrow → to sig-
labeled as the claim. Furthermore, we do not iden-                           nify the relationship between sentences. We train
tify a single claim as the major claim since there                           a sequence-to-sequence model on our linearized
may be multiple major points of discussion.                                  graph input, which we call -arg-graph.

Graph Construction For the final graph, for                                  5     Experimental Settings
each of the documents in an example, we run the
above procedure and obtain a set of claims and as-                           We use the fairseq codebase (Ott et al., 2019) for
sociated premises. We then identify support edges                            our experiments. Our base abstractive text summa-
between claims, which may be across documents.                               rization model is BART-large (Lewis et al., 2020),
One claim may make a larger assertion, which is                              a pretrained denoising autoencoder with 336M pa-
supported by other claims. We run our entailment                             rameters that builds on the sequence-to-sequence
model over all potential edges (in both directions)                          transformer of Vaswani et al. (2017). We fine-
among claims in the document and greedily add                                tune BART using a polynomial decay learning rate
edges according to the entailment support score                              scheduler with Adam optimizer (Kingma and Ba,
while no cycles are made. After this step, we are                            2015). We used a learning rate of 3e-5 and warmup
left with a set of claims which do not entail any                            and total updates of 20 and 200, following previ-
other nodes or, stated otherwise, do not have parent                         ous few-shot transfer work (Fabbri et al., 2020a).
nodes. Following the terminology of Barker and                               We could have equally fine-tuned other pretrained
Gaizauskas (2016b), these nodes can be considered                            models such as Pegasus (Zhang et al., 2019) or
viewpoints.                                                                  T5 (Raffel et al., 2019), but Fabbri et al. (2020a)
   We then identify issues or topics on which the                            find that BART largely performs equally well in
viewpoints differ. We run our entailment model for                           few-shot settings when compared to Pegasus.
all parent claim nodes again in both directions over                            For the NYT and Stack datasets, which con-
these claims and identify nodes that contradict each                         tain sequences over the typical 1024 max encoder
other with probability over 0.33, based on manual                            length with which BART is trained, we copied the
analysis of the resulting graphs. We greedily add                            encoder positional embeddings to allow sequences
edges to maintain a tree structure, joining these                            up to length 2048. To address the input-length of
nodes to a special node, which we call the Issue                             meeting summaries, which range from 6k to 12k to-
node. All Issue nodes, as well as claims which are                           kens, we use the Longformer (Beltagy et al., 2020),
not connected to any Issue node, are connected to                            which allows for sequences up to length 16k to-

                                                                          6872

Method/Dataset            AMI                 ICSI                 Dataset/Method      Our results       Previous SOTA
       HMNet             53.02/18.57/-       46.28/10.60/-               SAMSum        52.27/27.82/47.92   49.30/25.60/47.70
      DDA-GCN            53.15/22.32/-             -                   CQASUMM         32.79/6.68/28.83    31.00/5.00/15.20
  Longformer-BART      54.20/20.72/51.36   43.03/12.14/40.26               BC3         39.59/13.98/21.20           -
 Longformer-BART-arg   54.47/20.83/51.74   44.17/11.69/41.33
                                                                           ADS         37.18/11.42/21.27           -
                                                                          SENSEI       34.57/7.08/16.80            -
Table 6: ROUGE-1/2/L results for DDA-GCN (Feng
et al., 2020) and HMNet (Zhu et al., 2020) on the AMI             Table 7: Benchmarking results on conversational
and ICSI meeting summarization dataset along with                 datasets such as SAMSum (Gliwa et al., 2019b) and
our Longformer and Longformer-arg models.                         CQASUMM (Chowdhury and Chakraborty, 2018) and
                                                                  initial neural abstractive summarization results for
                                                                  email (BC3) (Ulrich et al., 2008), debate discussion fo-
kens. We initialize the Longformer model with
                                                                  rums (ADS) (Misra et al., 2015), and news comments
BART parameters trained on the CNN-DailyMail                      (SENSEI) (Barker et al., 2016b).
dataset, as the meeting summarization datasets con-
tain fewer than 100 data points. We otherwise
fine-tune models from vanilla BART, following in-                 from modeling the argument structure or removing
tuition in few-shot summarization (Fabbri et al.,                 non-argumentative units. We provide full results
2020a) and based on initial experiments. In the                   for both variations in the Appendix.
tables which follow, ”-arg” refers to any model
trained with argument-mining-based input, and we                   Benchmarking Other Conversation Summa-
specify which -arg-graph or -arg-filtered settings                 rization Datasets We benchmark our models on
were used for each dataset below.                                 widely used meeting summarization datasets. Due
                                                                   to the input’s linear nature and the size of the meet-
6   Results                                                        ing transcripts, we found improved results using
                                                                  -arg-filtered to filter non-argumentative units rather
We provide results for baseline, unsupervised ex-                  than incorporating the graph structure. Results are
tractive models in Table 4. Lexrank (Erkan and                     shown in Table 6. The Longformer model performs
Radev, 2004) and Textrank (Mihalcea and Tarau,                     as well or better than previous state-of-the-art re-
2004), and BERT-ext (Miller, 2019), which makes                    sults on these datasets, despite not making use of
use of BERT (Devlin et al., 2019). The unsuper-                    more complex modeling structures, and we gener-
vised extractive models perform well below the                     ally see improvement with argument-mining.
extractive oracle performance, suggesting the diffi-                  As noted above, there exist prior datasets for
culty of content selection in this setting.                        dialogue, community question answering, email,
   We train BART on 200 examples from our vali-                    forum, and news comments summarization. We
dation set for abstractive models, using the remain-               benchmark results on these datasets in Table 7.
ing 50 as validation and test on the final test set of            We outperform prior work on SAMSum (Gliwa
250 examples. We tested zero-shot transfer from                    et al., 2019b), and CQASUMM (Chowdhury and
CNNDM and SAMSum in zero-shot settings, al-                        Chakraborty, 2018) with our BART and BART-arg-
though these resulted in a much lower performance                  graph models, respectively. We did not find im-
of about 28 ROUGE-1. Few-shot model perfor-                        provement on SAMSum with the BART-arg model
mance is shown in Table 5. The abstractive model                   due to the extremely short and focused nature
performs at or above the Extractive Oracle, sug-                   of the dialogues, analogous to email data perfor-
gesting the need for better abstractive models.                    mance. We also provide transfer results of BART
   We also train on our argument mining-based                      and BART-arg-graph models from our email and
approaches and show results in Table 5. We see                     news-comment data to BC3 (Ulrich et al., 2009),
ROUGE improvements when applying BART-arg-                        ADS (Misra et al., 2015), and SENSEI data (Barker
graph for Reddit, and Stack data. The -arg-filtered                et al., 2016b), for which no prior neural abstractive
variation (which, as defined in Section 4, is the less             summarization results existed.
noisy version of the input produced by the argu-
ment extraction step) outperformed the -arg-graph                 Human Evaluations We collect human judg-
variation on both email and NYT data. For email                   ment annotations for two of the four quality dimen-
data, however, this did not improve upon the BART                 sions studied in Kryscinski et al. (2019) and Fabbri
baseline, likely due to the dataset’s characteristics;            et al. (2020b), namely consistency and relevance.
email data is shorter and more linear, not benefiting             Consistency is defined as the factual alignment be-

                                                               6873

Target Dataset BART BART-arg
Relevance Consistency Relevance Consistency
ing of the input texts’ structure. We provide results
Reddit 3.39 (0.13) 3.40 (0.12) 3.47 (0.12) 3.41 (0.10) for baseline models and propose to model the text’s
AMI 4.07 (0.16) 3.67 (0.16) 4.13 (0.17) 3.70 (0.17)
argument structure, showing that such structure
Table 8: Mean relevance and factual consistency anno- helps better quantify viewpoints in non-linear in-
tations for BART and BART-arg outputs on Reddit and put in both automatic and human evaluations. Our
AMI. Standard errors are reported in parentheses. analysis notes challenges in modeling relevance
and consistency in abstractive conversation summa-
rization when compared to news summarization.
tween the summary and the summarized source
text, while relevance is defined as the summary’s 8 Ethical Considerations
ability to select important content; only relevant in-
formation and viewpoints should be included. We As we propose novel conversation summarization
did not include fluency as an initial inspection of datasets and modeling components, this section is
the data found fluency to be of very high quality, divided into the following two parts.
as has shown to be the case for pretrained models
in news summarization (Fabbri et al., 2020b). We 8.1 New Dataset
did not include coherence as this was generally not Intellectual Properties and Privacy Rights All
an issue of concern in the initial analysis. data for our newly-introduced datasets are avail-
We randomly select 25 random examples from able online; please see the following for New York
the Reddit corpus and ten examples from the AMI Times comment data2 , StackExchange data3 , and
corpus, and output from the BART and BART-arg- W3C email data4 . Reddit data is available via the
graph models. These data points were chosen to Google BigQuery tool5 .
demonstrate what characteristics are realized in dif-
ferences across ROUGE for argument-graph and Compensation for Annotators We compen-
argument-noise-reduction approaches. Ten exam- sated the Turkers approximately $12–$15 per hour.
ples were chosen from AMI due to the size of the We first annotated examples in-house to determine
input and annotation constraints. The annotator the required annotation speed. Typically, the sum-
sees the source article and randomly-ordered out- marization task took around 10 minutes, and we
put from the model and then rates the summaries compensated the workers from $2.25 to $3.00 per
for relevance and consistency on a Likert from 1 task, depending on the domain and deadline re-
to 5, with 5 being the best score. We averaged quirements.
the score of three native English-speaking anno- Steps Taken to Avoid Potential Problems We
tators on each example and then across examples. interacted closely with the Turkers to ensure that
Results are shown in Table 8. We find that the compensation was fair and that the instructions
annotators prefer our argument mining-based ap- were clear. To maintain the quality of the dataset,
proaches in both dimensions. However, the results we manually reviewed the crowdsourced sum-
are close. Furthermore, the scores for relevance maries for language use. Initial investigation into
and consistency are rather low, especially on the Reddit data showed certain inappropriate language
Reddit dataset and when compared to results on the usage, so we filtered these examples automatically.
CNN-DailyMail Dataset from Fabbri et al. (2020b).
These results demonstrate the difficulty of mod- 8.2 NLP Application
eling such conversational data. Examples are in- Bias Biases may exist in the datasets, such as po-
cluded in the appendix. litical bias in the news datasets and gender bias in
7 Conclusion potentially all of the datasets. Thus, models trained
on these datasets may propagate these biases. We
We propose ConvoSumm, a benchmark of four 2
https://www.kaggle.com/aashita/
new, crowdsourced conversation datasets and state- nyt-comments
of-the-art baselines on widely-used datasets that 3
https://archive.org/download/
promote more unified progress in summarization stackexchange
4
beyond the news domain. Our benchmark consists https://tides.umiacs.umd.edu/webtrec/
trecent/parsed_w3c_corpus.html
of high-quality, human-written summaries that call 5
https://console.cloud.google.com/
for abstractive summaries and a deeper understand- bigquery

6874

removed data with offensive language when possi- 2016b. The SENSEI annotated corpus: Human sum-
ble. maries of reader comment conversations in on-line
news. In Proceedings of the 17th Annual Meeting
Misuse Potential and Failure Mode When of the Special Interest Group on Discourse and Di-
alogue, pages 42–52, Los Angeles. Association for
used as intended, applying the summarization mod- Computational Linguistics.
els described in this paper can save people much
time. However, the current models are still prone Christos Baziotis, Ion Androutsopoulos, Ioannis
to producing hallucinated summaries, and in such Konstas, and Alexandros Potamianos. 2019. SEQˆ3:
Differentiable sequence-to-sequence-to-sequence
a case, they may contribute to misinformation on
autoencoder for unsupervised abstractive sentence
the internet. Further research is needed to ensure compression. In Proceedings of the 2019 Con-
the faithfulness of abstractive summaries to address ference of the North American Chapter of the
this issue, as this issue is present among all current Association for Computational Linguistics: Human
abstractive summarization models. Language Technologies, Volume 1 (Long and Short
Papers), pages 673–681, Minneapolis, Minnesota.
Association for Computational Linguistics.
Environmental Cost The experiments described
in the paper make use of V100 GPUs. We used Iz Beltagy, Matthew E. Peters, and Arman Cohan.
up to 8 GPUs per experiment (depending on the 2020. Longformer: The long-document transformer.
experiment; sometimes, a single GPU was used to arXiv:2004.05150.
run the maximum number of experiments in paral-
Arthur Bražinskas, Mirella Lapata, and Ivan Titov.
lel). The experiments may take up to a couple of 2020. Unsupervised opinion summarization as
hours for the larger datasets. Several dozen experi- copycat-review generation. In Proceedings of the
ments were run due to parameter search, and future 58th Annual Meeting of the Association for Compu-
work should experiment with distilled models for tational Linguistics, pages 5151–5169, Online. As-
sociation for Computational Linguistics.
more light-weight training. We note that while our
work required extensive experiments to draw sound Tuhin Chakrabarty, Christopher Hidey, Smaranda
conclusions, future work will be able to draw on Muresan, Kathy McKeown, and Alyssa Hwang.
these insights and need not run as many large-scale 2019. AMPERSAND: Argument mining for PER-
SuAsive oNline discussions. In Proceedings of the
comparisons. Models in production may be trained 2019 Conference on Empirical Methods in Natu-
once for use using the most promising settings. ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2933–2943, Hong Kong,
References China. Association for Computational Linguistics.

Emma Barker and Robert Gaizauskas. 2016a. Sum- Jiaao Chen and Diyi Yang. 2020. Multi-view sequence-
marizing multi-party argumentative conversations to-sequence models with conversational structure
in reader comment on news. In Proceedings of for abstractive dialogue summarization. In Proceed-
the Third Workshop on Argument Mining (ArgMin- ings of the 2020 Conference on Empirical Methods
ing2016), pages 12–20, Berlin, Germany. Associa- in Natural Language Processing (EMNLP), pages
tion for Computational Linguistics. 4106–4118, Online. Association for Computational
Linguistics.
Emma Barker and Robert Gaizauskas. 2016b. Sum-
marizing multi-party argumentative conversations Carlos Chesnevar, Sanjay Modgil, Iyad Rahwan, Chris
in reader comment on news. In Proceedings of Reed, Guillermo Simari, Matthew South, Gerard
the Third Workshop on Argument Mining (ArgMin- Vreeswijk, Steven Willmott, et al. 2006. Towards
ing2016), pages 12–20, Berlin, Germany. Associa- an argument interchange format. The knowledge en-
tion for Computational Linguistics. gineering review, 21(4):293–316.

Emma Barker, Monica Lestari Paramita, Ahmet Aker, Tanya Chowdhury and Tanmoy Chakraborty. 2018.
Emina Kurtic, Mark Hepple, and Robert Gaizauskas. Cqasumm: Building references for community ques-
2016a. The SENSEI annotated corpus: Human sum- tion answering summarization corpora.
maries of reader comment conversations in on-line
news. In Proceedings of the 17th Annual Meeting Eric Chu and Peter J. Liu. 2019. Meansum: A neu-
of the Special Interest Group on Discourse and Di- ral model for unsupervised multi-document abstrac-
alogue, pages 42–52, Los Angeles. Association for tive summarization. In Proceedings of the 36th In-
Computational Linguistics. ternational Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA,
Emma Barker, Monica Lestari Paramita, Ahmet Aker, volume 97 of Proceedings of Machine Learning Re-
Emina Kurtic, Mark Hepple, and Robert Gaizauskas. search, pages 1223–1232. PMLR.

6875

Nick Craswell, Arjen P de Vries, and Ian Soboroff. and few-shot abstractive summarization with inter-
2005. Overview of the trec 2005 enterprise track. mediate fine-tuning and data augmentation. arXiv
In TREC, volume 5, pages 199–205. preprint arXiv:2010.12836.

Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Alexander R Fabbri, Wojciech Kryściński, Bryan
Yaliang Li, Min Yang, and Ying Shen. 2020. Joint McCann, Caiming Xiong, Richard Socher, and
learning of answer selection and answer summary Dragomir Radev. 2020b. Summeval: Re-
generation in community question answering. In evaluating summarization evaluation. arXiv
The Thirty-Fourth AAAI Conference on Artificial In- preprint arXiv:2007.12626.
telligence, AAAI 2020, The Thirty-Second Innova-
tive Applications of Artificial Intelligence Confer- Angela Fan, Yacine Jernite, Ethan Perez, David Grang-
ence, IAAI 2020, The Tenth AAAI Symposium on Ed- ier, Jason Weston, and Michael Auli. 2019. ELI5:
ucational Advances in Artificial Intelligence, EAAI Long form question answering. In Proceedings of
2020, New York, NY, USA, February 7-12, 2020, the 57th Annual Meeting of the Association for Com-
pages 7651–7658. AAAI Press. putational Linguistics, pages 3558–3567, Florence,
Italy. Association for Computational Linguistics.
Shrey Desai, Jiacheng Xu, and Greg Durrett. 2020. Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin-
Compressive summarization with plausibility and wei Geng, and Ting Liu. 2020. Dialogue
salience modeling. In Proceedings of the 2020 Con- discourse-aware graph convolutional networks for
ference on Empirical Methods in Natural Language abstractive meeting summarization. arXiv preprint
Processing (EMNLP), pages 6259–6274, Online. As- arXiv:2012.03502.
sociation for Computational Linguistics.
Cecilia E Ford. 1991. Linguistics: The cambridge sur-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and vey: Volume 4. language: The socio-cultural context.
Kristina Toutanova. 2019. BERT: Pre-training of frederick h. newmeyer (ed.). Studies in Second Lan-
deep bidirectional transformers for language under- guage Acquisition, 13(3):412–413.
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Michel Galley. 2006. A skip-chain conditional random
for Computational Linguistics: Human Language field for ranking meeting utterances by importance.
Technologies, Volume 1 (Long and Short Papers), In Proceedings of the 2006 Conference on Empiri-
pages 4171–4186, Minneapolis, Minnesota. Associ- cal Methods in Natural Language Processing, pages
ation for Computational Linguistics. 364–372, Sydney, Australia. Association for Com-
putational Linguistics.
Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan-
moy Chakraborty. 2020a. Corpora evaluation and Prakhar Ganesh and Saket Dingliwal. 2019. Abstrac-
system bias detection in multi-document summariza- tive summarization of spoken and written conversa-
tion. In Findings of the Association for Computa- tion. CoRR, abs/1902.01615.
tional Linguistics: EMNLP 2020, pages 2830–2840,
Online. Association for Computational Linguistics. Demian Gholipour Ghalandari, Chris Hokamp,
Nghia The Pham, John Glover, and Georgiana Ifrim.
Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan- 2020. A large-scale multi-document summarization
moy Chakraborty. 2020b. Corpora evaluation and dataset from the Wikipedia current events portal.
system bias detection in multi document summariza- In Proceedings of the 58th Annual Meeting of the
tion. In Proceedings of the 2020 Conference on Association for Computational Linguistics, pages
Empirical Methods in Natural Language Processing: 1302–1308, Online. Association for Computational
Findings, pages 2830–2840, Online. Association for Linguistics.
Computational Linguistics. Dan Gillick and Yang Liu. 2010. Non-expert evalua-
tion of summarization systems is risky. In Proceed-
Günes Erkan and Dragomir R Radev. 2004. Lexrank: ings of the NAACL HLT 2010 Workshop on Creating
Graph-based lexical centrality as salience in text Speech and Language Data with Amazon’s Mechan-
summarization. Journal of artificial intelligence re- ical Turk, pages 148–151, Los Angeles. Association
search, 22:457–479. for Computational Linguistics.
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
Dragomir Radev. 2019. Multi-news: A large-scale Aleksander Wawer. 2019a. SAMSum corpus: A
multi-document summarization dataset and abstrac- human-annotated dialogue dataset for abstractive
tive hierarchical model. In Proceedings of the 57th summarization. In Proceedings of the 2nd Workshop
Annual Meeting of the Association for Computa- on New Frontiers in Summarization, pages 70–79,
tional Linguistics, pages 1074–1084, Florence, Italy. Hong Kong, China. Association for Computational
Association for Computational Linguistics. Linguistics.
Alexander R Fabbri, Simeng Han, Haoyuan Li, Haoran Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Aleksander Wawer. 2019b. SAMSum corpus: A
Radev, and Yashar Mehdad. 2020a. Improving zero human-annotated dialogue dataset for abstractive

6876

summarization. In Proceedings of the 2nd Workshop 9th International Joint Conference on Natural Lan-
on New Frontiers in Summarization, pages 70–79, guage Processing (EMNLP-IJCNLP), pages 540–
Hong Kong, China. Association for Computational 551, Hong Kong, China. Association for Computa-
Linguistics. tional Linguistics.

Chih-Wen Goo and Yun-Nung Chen. 2018a. Ab- Philippe Laban, Andrew Hsi, John Canny, and Marti A.
stractive dialogue summarization with sentence- Hearst. 2020. The summary loop: Learning to write
gated modeling optimized by dialogue acts. In abstractive summaries without examples. In Pro-
2018 IEEE Spoken Language Technology Workshop ceedings of the 58th Annual Meeting of the Asso-
(SLT), pages 735–742. IEEE. ciation for Computational Linguistics, pages 5135–
5150, Online. Association for Computational Lin-
Chih-Wen Goo and Yun-Nung Chen. 2018b. Abstrac- guistics.
tive dialogue summarization with sentence-gated
modeling optimized by dialogue acts. CoRR, Mirko Lenz, Premtim Sahitaj, Sean Kallenberg,
abs/1809.05715. Christopher Coors, Lorik Dumani, Ralf Schenkel,
and Ralph Bergmann. 2020. Towards an argu-
Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You ment mining pipeline transforming texts to argument
Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi graphs. arXiv preprint arXiv:2006.04562.
Zhai, and Nicholas Zukoski. 2020. Generating rep-
resentative headlines for news stories. In WWW ’20: Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
The Web Conference 2020, Taipei, Taiwan, April 20- jan Ghazvininejad, Abdelrahman Mohamed, Omer
24, 2020, pages 1773–1784. ACM / IW3C2. Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
Doris Hoogeveen, Karin M Verspoor, and Timothy training for natural language generation, translation,
Baldwin. 2015. Cqadupstack: A benchmark data and comprehension. In Proceedings of the 58th An-
set for community question-answering research. In nual Meeting of the Association for Computational
Proceedings of the 20th Australasian document com- Linguistics, pages 7871–7880, Online. Association
puting symposium, pages 1–8. for Computational Linguistics.

Xinyu Hua and Lu Wang. 2017. A pilot study of do- Chin-Yew Lin. 2004. ROUGE: A package for auto-
main adaptation effect for neural abstractive sum- matic evaluation of summaries. In Text Summariza-
marization. In Proceedings of the Workshop on tion Branches Out, pages 74–81, Barcelona, Spain.
New Frontiers in Summarization, pages 100–106, Association for Computational Linguistics.
Copenhagen, Denmark. Association for Computa- Chunyi Liu, Peng Wang, Jiang Xu, Zang Li, and
tional Linguistics. Jieping Ye. 2019a. Automatic dialogue summary
generation for customer service. In Proceedings of
Adam Janin, Don Baron, Jane Edwards, Dan Ellis,
the 25th ACM SIGKDD International Conference on
David Gelbart, Nelson Morgan, Barbara Peskin,
Knowledge Discovery & Data Mining, KDD 2019,
Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke,
Anchorage, AK, USA, August 4-8, 2019, pages 1957–
et al. 2003. The icsi meeting corpus. In 2003 IEEE
1965. ACM.
International Conference on Acoustics, Speech, and
Signal Processing, 2003. Proceedings.(ICASSP’03)., Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
volume 1, pages I–I. IEEE. Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating wikipedia by summariz-
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A ing long sequences. In 6th International Conference
method for stochastic optimization. In 3rd Inter- on Learning Representations, ICLR 2018, Vancou-
national Conference on Learning Representations, ver, BC, Canada, April 30 - May 3, 2018, Confer-
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, ence Track Proceedings. OpenReview.net.
Conference Track Proceedings.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Varada Kolhatkar and Maite Taboada. 2017. Using dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
New York Times picks to identify constructive com- Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
ments. In Proceedings of the 2017 EMNLP Work- Roberta: A robustly optimized bert pretraining ap-
shop: Natural Language Processing meets Journal- proach. arXiv preprint arXiv:1907.11692.
ism, pages 100–105, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. Zhengyuan Liu, Angela Ng, Sheldon Lee Shao Guang,
Ai Ti Aw, and Nancy F. Chen. 2019c. Topic-aware
Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wil- pointer-generator networks for summarizing spoken
fried Post. 2005. The ami meeting corpus. conversations. CoRR, abs/1910.01335.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Rada Mihalcea and Paul Tarau. 2004. TextRank:
Cann, Caiming Xiong, and Richard Socher. 2019. Bringing order into text. In Proceedings of the 2004
Neural text summarization: A critical evaluation. In Conference on Empirical Methods in Natural Lan-
Proceedings of the 2019 Conference on Empirical guage Processing, pages 404–411, Barcelona, Spain.
Methods in Natural Language Processing and the Association for Computational Linguistics.

6877

Derek Miller. 2019. Leveraging bert for extractive Shasha Xie, Yang Liu, and Hui Lin. 2008. Evaluating
text summarization on lectures. arXiv preprint the effectiveness of features and sampling in extrac-
arXiv:1906.04165. tive meeting summarization. In 2008 IEEE Spoken
Language Technology Workshop, pages 157–160.
Amita Misra, Pranav Anand, Jean E. Fox Tree, and
Marilyn Walker. 2015. Using summarization to dis- Christian Stab and Iryna Gurevych. 2014. Identifying
cover argument facets in online idealogical dialog. argumentative discourse structures in persuasive es-
In Proceedings of the 2015 Conference of the North says. In Proceedings of the 2014 Conference on
American Chapter of the Association for Computa- Empirical Methods in Natural Language Processing
tional Linguistics: Human Language Technologies, (EMNLP), pages 46–56, Doha, Qatar. Association
pages 430–440, Denver, Colorado. Association for for Computational Linguistics.
Computational Linguistics.
Manfred Stede, Stergos Afantenos, Andreas Peldszus,
Amita Misra, Pranav Anand, Jean E Fox Tree, and Mar- Nicholas Asher, and Jérémy Perret. 2016. Parallel
ilyn Walker. 2017. Using summarization to discover discourse annotations on a corpus of short texts. In
argument facets in online ideological dialog. arXiv Proceedings of the Tenth International Conference
preprint arXiv:1709.00662. on Language Resources and Evaluation (LREC’16),
pages 1051–1058, Portorož, Slovenia. European
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Language Resources Association (ELRA).
Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac-
tive text summarization using sequence-to-sequence Mattia Tomasoni and Minlie Huang. 2010. Metadata-
RNNs and beyond. In Proceedings of The 20th aware measures for answer summarization in com-
SIGNLL Conference on Computational Natural Lan- munity question answering. In Proceedings of the
guage Learning, pages 280–290, Berlin, Germany. 48th Annual Meeting of the Association for Compu-
Association for Computational Linguistics. tational Linguistics, pages 760–769, Uppsala, Swe-
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. den. Association for Computational Linguistics.
2018. Don’t give me the details, just the summary! J. Ulrich, G. Murray, and G. Carenini. 2008. A publicly
topic-aware convolutional neural networks for ex- available annotated corpus for supervised email sum-
treme summarization. In Proceedings of the 2018 marization. In AAAI08 EMAIL Workshop, Chicago,
Conference on Empirical Methods in Natural Lan- USA. AAAI.
guage Processing, pages 1797–1807, Brussels, Bel-
gium. Association for Computational Linguistics. Jan Ulrich, Giuseppe Carenini, Gabriel Murray, and
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Raymond Ng. 2009. Regression-based summariza-
Fan, Sam Gross, Nathan Ng, David Grangier, and tion of email conversations. In Proceedings of the
Michael Auli. 2019. fairseq: A fast, extensible International AAAI Conference on Web and Social
toolkit for sequence modeling. In Proceedings of Media, volume 3.
the 2019 Conference of the North American Chap- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ter of the Association for Computational Linguistics Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
(Demonstrations), pages 48–53, Minneapolis, Min- Kaiser, and Illia Polosukhin. 2017. Attention is all
nesota. Association for Computational Linguistics. you need. In Advances in Neural Information Pro-
Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and cessing Systems 30: Annual Conference on Neural
Raymond Ng. 2014. A template-based abstractive Information Processing Systems 2017, December 4-
meeting summarization: Leveraging summary and 9, 2017, Long Beach, CA, USA, pages 5998–6008.
source text relationships. In Proceedings of the 8th
Danqing Wang, Pengfei Liu, Ming Zhong, Jie Fu,
International Natural Language Generation Confer-
Xipeng Qiu, and Xuanjing Huang. 2019. Exploring
ence (INLG), pages 45–53, Philadelphia, Pennsylva-
domain shift in extractive text summarization.
nia, U.S.A. Association for Computational Linguis-
tics. Lu Wang and Claire Cardie. 2013. Domain-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine independent abstract generation for focused meeting
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, summarization. In Proceedings of the 51st Annual
Wei Li, and Peter J Liu. 2019. Exploring the limits Meeting of the Association for Computational Lin-
of transfer learning with a unified text-to-text trans- guistics (Volume 1: Long Papers), pages 1395–1405,
former. arXiv preprint arXiv:1910.10683. Sofia, Bulgaria. Association for Computational Lin-
guistics.
Leonardo FR Ribeiro, Martin Schmitt, Hinrich Schütze,
and Iryna Gurevych. 2020. Investigating pretrained Adina Williams, Nikita Nangia, and Samuel Bowman.
language models for graph-to-text generation. arXiv 2018. A broad-coverage challenge corpus for sen-
preprint arXiv:2007.08426. tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Evan Sandhaus. 2008. The new york times annotated Chapter of the Association for Computational Lin-
corpus. Linguistic Data Consortium, Philadelphia, guistics: Human Language Technologies, Volume
6(12):e26752. 1 (Long Papers), pages 1112–1122, New Orleans,

6878

You can also read