CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization

Page created by Alex Osborne
 
CONTINUE READING
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality
                                                             in Abstractive Summarization
                                                                             Shuyang Cao and Lu Wang
                                                                           Computer Science and Engineering
                                                                               University of Michigan
                                                                                   Ann Arbor, MI
                                                                        {caoshuy, wangluxy}@umich.edu

                                                              Abstract                            XSum Article: The Fermanagh MLA Phil Flanagan tweeted
                                                                                                  after Tom Elliott appeared on a BBC radio programme in
                                             We study generating abstractive summaries            May 2014. . . . “I wonder if he will reveal how many people
                                             that are faithful and factually consistent with      he harassed and shot as a member of the UDR.”. . .
                                             the given articles. A novel contrastive learn-       Contrastive learning (our method): A Sinn Féin MLA has
arXiv:2109.09209v1 [cs.CL] 19 Sep 2021

                                             ing formulation is presented, which leverages        been ordered to apologise and pay compensation to a former
                                                                                                  member of the Ulster Defence Regiment (UDR).
                                             both reference summaries, as positive training
                                             data, and automatically generated erroneous          Cross-entropy: A Sinn Féin MLA has agreed to pay com-
                                             summaries, as negative training data, to train       pensation to a former Ulster Unionist Party (UDR) MP after
                                                                                                  he tweeted that he had harassed and shot people as a member
                                             summarization systems that are better at dis-
                                                                                                  of the party.
                                             tinguishing between them. We further de-
                                             sign four types of strategies for creating neg-      Entailment reranking: A Sinn Féin MLA has agreed to
                                                                                                  pay compensation to a former Ulster Unionist Party (UDR)
                                             ative samples, to resemble errors made com-          councillor for a tweet he sent about him.
                                             monly by two state-of-the-art models, BART
                                             and PEGASUS, found in our new human an-              Unlikelihood: An MLA has been ordered to apologise and
                                                                                                  pay compensation to a former loyalist MP for a remark he
                                             notations of summary errors. Experiments on
                                                                                                  made about him while serving in the Ministry of Defence.
                                             XSum and CNN/Daily Mail show that our con-
                                             trastive learning framework is robust across
                                                                                                  Figure 1: Sample article and system summaries by dif-
                                             datasets and models. It consistently produces
                                                                                                  ferent methods. Our contrastive learning model trained
                                             more factual summaries than strong compar-
                                                                                                  on low confidence system outputs correctly generates
                                             isons with post error correction, entailment-
                                                                                                  the full name. Comparisons using cross-entropy loss,
                                             based reranking, and unlikelihood training, ac-
                                                                                                  beam reranking by entailment scores (Kryscinski et al.,
                                             cording to QA-based factuality evaluation. Hu-
                                                                                                  2020), and unlikelihood objective (Welleck et al., 2020)
                                             man judges echo the observation and find that
                                                                                                  over negative samples all produce unfaithful content.
                                             our model summaries correct more errors.

                                         1   Introduction
                                                                                                     Our goal is to train abstractive summarization
                                         Large pre-trained Transformers have yielded re-          systems that generate both faithful and informative
                                         markable performance on abstractive summariza-           summaries in an end-to-end fashion. We observe
                                         tion (Liu and Lapata, 2019; Lewis et al., 2020;          that, while the commonly used maximum likeli-
                                         Zhang et al., 2020a) with impeccable fluency, yet        hood training optimizes over references, there is no
                                         their summaries often contain factually inconsis-        guarantee for the model to distinguish references
                                         tent content (Maynez et al., 2020; Zhang et al.,         from incorrect generations (Holtzman et al., 2020;
                                         2020b; Goyal and Durrett, 2020), even for state-         Welleck et al., 2020). Therefore, potential solu-
                                         of-the-art models. Three types of remedies have          tions reside in designing new learning objectives
                                         been proposed: running a separately learned error        that can effectively inform preferences of factual
                                         correction component (Dong et al., 2020), remov-         summaries over incorrect ones.
                                         ing noisy training samples (Nan et al., 2021; Goyal         Concretely, we hypothesize that including factu-
                                         and Durrett, 2021), and modifying the Transformer        ally inconsistent summaries (i.e., negative samples)
                                         architecture (Huang et al., 2020; Zhu et al., 2021).     for training, in addition to references (i.e., positive
                                         Yet they either rely on heuristically created data for   samples), let models become better at differentiat-
                                         error handling, falling short of generalization, or      ing these two types of summaries. Although using
                                         require learning a large number of new parameters,       negative samples has been effective at text repre-
                                         and summary informativeness is often sacrificed.         sentation learning, e.g., word2vec (Mikolov et al.,
2013) and BERT (Devlin et al., 2019), there exist           intrinsic errors over baseline across datasets.
two major challenges for it to succeed in concrete
language tasks. First, a suitable training objective        2   Related Work
is critical to avoid performance degradation (Saun-         Factuality Improvement and Evaluation. Neu-
shi et al., 2019). Second, it is nontrivial to con-         ral abstractive summaries often contain unfaith-
struct “natural” samples that mimic the diverse             ful content with regard to the source (Falke et al.,
errors made by state-of-the-art systems that vary in        2019). To improve summary factuality, three major
words and syntax (Goyal and Durrett, 2021).                 types of approaches are proposed. First, a separate
   To address both challenges, we first propose             correction model is learned to fix errors made by
a new framework, CLIFF, that uses contrastive               the summarizers (Zhao et al., 2020; Chen et al.,
learning for improving faithfulness and factuality          2021), including replacing entities absent from the
of the generated summaries.1 Contrastive learn-             source (Dong et al., 2020) or revising all possi-
ing (CL) has obtained impressive results on many            ble errors (Cao et al., 2020). The second type tar-
visual processing tasks, such as image classifica-          gets at modifying the sequence-to-sequence archi-
tion (Khosla et al., 2020; Chen et al., 2020) and           tecture to incorporate relation triplets (Cao et al.,
synthesis (Park et al., 2020; Zhang et al., 2021b).         2018), knowledge graphs (Zhu et al., 2021), and
Intuitively, CL improves representation learning            topics (Aralikatte et al., 2021) to inform the sum-
by compacting positive samples while contrasting            marizers of article facts. Yet additional engineer-
them with negative samples. Here, we design a               ing efforts and model retraining are often needed.
task-specific CL formulation that teaches a sum-            Finally, discarding noisy samples from model train-
marizer to expand the margin between factually              ing has also been investigated (Nan et al., 2021;
consistent summaries and their incorrect peers.             Goyal and Durrett, 2021), however, it often leads
   Moreover, we design four types of strategies             to degraded summary informativeness. In compari-
with different variants to construct negative sam-          son, our contrastive learning framework allows the
ples by editing reference summaries via rewriting           model to be end-to-end trained and does not re-
entity-/relation-anchored text, and using system            quire model modification, thus providing a general
generated summaries that may contain unfaithful             solution for learning summarization systems.
errors. Importantly, these strategies are inspired             Alongside improving factuality, we have also
by our new annotation study on errors made by               witnessed growing interests in automated factu-
state-of-the-art summarizers—models fine-tuned              ality evaluation, since popular word-matching-
from BART (Lewis et al., 2020) and PEGA-                    based metrics, e.g., ROUGE, correlate poorly with
SUS (Zhang et al., 2020a)—on two benchmarks:                human-rated factual consistency levels (Gabriel
XSum (Narayan et al., 2018) and CNN/DailyMail               et al., 2021; Fabbri et al., 2021). Entailment-based
(CNN/DM) (Hermann et al., 2015).                            scorers are designed at summary level (Kryscinski
   We fine-tune pre-trained large models with               et al., 2020) and finer-grained dependency relation
our contrastive learning objective on XSum and              level (Goyal and Durrett, 2020). QA models are
CNN/DM. Results based on QuestEval (Scialom                 employed to measure content consistency by read-
et al., 2021), a QA-based factuality metric of high         ing the articles to answer questions generated from
correlation with human judgments, show that our             the summaries (Wang et al., 2020; Durmus et al.,
models trained with different types of negative sam-        2020), or considering the summaries for addressing
ples uniformly outperform strong comparisons, in-           questions derived from the source (Scialom et al.,
cluding using a summarizer with post error cor-             2019). Though not focusing on evaluation, our
rection and reranking beams based on entailment             work highlights that models can produce a signif-
scores to the source. Moreover, compared with un-           icant amount of world knowledge which should
likelihood training method that penalizes the same          be evaluated differently instead of as extrinsic hal-
negative samples (Welleck et al., 2020), our sum-           lucination (Maynez et al., 2020). We also show
maries also obtain consistently better QuestEval            that world knowledge can possibly be distinguished
scores. Human evaluation further confirms that              from errors via model behavior understanding.
our models consistently reduce both extrinsic and
                                                            Training with negative samples has been investi-
  1
    Our code and annotated data are available at https://   gated in several classic NLP tasks, such as gram-
shuyangcao.github.io/projects/cliff_summ.                   matical error detection (Foster and Andersen, 2009)
and dialogue systems (Li et al., 2019). Notably,                  similarity between summary representations. τ is a
negative sampling plays a key role in word repre-                 temperature and is set to 1.0.
sentation learning (Mikolov et al., 2013) and train-                 Importantly, summaries in P and N are included
ing large masked language models, such as BERT                    in the same batch during training, so that the model
and ALBERT, to induce better contextual represen-                 acquires better representations to differentiate cor-
tations (Devlin et al., 2019; Lan et al., 2020). For              rect summaries from those with errors by compar-
text generation tasks, unlikelihood training is pro-              ing the two types of samples, thus maximizing the
posed to penalize the generation of negative tokens               probabilities of the positive samples and minimiz-
(e.g., repeated words) and sentences (e.g., contra-               ing the likelihoods of the corresponding negative
dictory responses in a dialogue system) (Welleck                  samples. The CL objective on the full training set,
et al., 2020; Li et al., 2020; He and Glass, 2020).               denoted as LCL , is the sum of losses lcl x over all
We use contrastive learning that drives enhanced                  samples.
representation learning to better distinguish be-                    To effectively employ CL in summarization, we
tween factual and incorrect summaries, which en-                  need to address two challenges: (1) how to auto-
courages more faithful summary generation.                        matically construct both positive and negative sam-
Contrastive Learning (CL) for NLP. CL has                         ples, which are critical for CL efficacy (Chen et al.,
been a popular method for representation learning,                2020), and (2) how to represent the summaries (i.e.,
especially for vision understanding (Hjelm et al.,                h∗ ). Below we describe positive sample genera-
2019; Chen et al., 2020). Only recently has CL                    tion and options for h∗ , leaving the strategies for
been used for training language models with self-                 negative samples to § 5.
supervision (Fang et al., 2020), learning sentence
representations (Gao et al., 2021), and improving                 Positive Sample Construction (P ). Summariza-
document clustering (Zhang et al., 2021a). With                   tion datasets often contain a single reference for
a supervised setup, Gunel et al. (2021) adopt the                 each article. To create multiple positive samples, in
contrastive objective to fine-tune pre-trained mod-               our pilot study, we experiment with paraphrasing
els on benchmark language understanding datasets.                 with synonym substitution (Ren et al., 2019), ran-
Using a similar idea, Liu and Liu (2021) enlarge                  domly replacing words based on the prediction of
the distances among summaries of different quality                masked language models (Kobayashi, 2018), and
as measured by ROUGE scores.                                      back-translation (Mallinson et al., 2017). We find
                                                                  back-translation to be best at preserving meaning
3   CLIFF: Contrastive Learning                                   and offering language variation, and thus use NL-
    Framework for Summarization                                   PAug2 to translate each reference to German and
                                                                  back to English. Together with the reference, the
We design a contrastive learning (CL)-based train-                best translation is kept and added to P , if no new
ing objective that drives the summarization model                 named entity is introduced.
to learn a preference of faithful summaries over
summaries with factual errors. It is then used for
                                                                  Summary Representation (h∗ ). We use the out-
fine-tuning BART (Lewis et al., 2020) and PEGA-
                                                                  puts of the decoder’s last layer, and investigate
SUS (Zhang et al., 2020a) for training summariza-
                                                                  three options that average over all tokens, named
tion models. Formally, let an article x have a set of
                                                                  entity tokens, and the last token of the decoded
reference summaries P (henceforth positive sam-
                                                                  summary. Entities and other parsing results are ob-
ples) and another set of erroneous summaries N
                                                                  tained by spaCy (Honnibal et al., 2020). We further
(negative samples). The contrastive learning objec-
                                                                  consider adding a multi-layer perceptron (MLP)
tive is (Khosla et al., 2020; Gunel et al., 2021):
                                                                  with one hidden layer to calculate the final h∗ .

 x        1         X                  exp(sim(hi , hj )/τ )      The final training objective combines the typical
lcl =−   |P |
                              log     P                           cross-entropy loss LCE and our contrastive learn-
                                          exp(sim(hi , hk )/τ )
              
          2       yi ,yj ∈P
                   yj 6=yi
                                    yk ∈P ∪N                      ing objective: L = LCE + λLCL , where λ is a
                                     yk 6=yi
                                                            (1)   scalar and set to 1.0 for all experiments.
where hi , hj , and hk are representations for sum-
maries yi , yj , and yk . sim(·, ·) calculates the cosine            2
                                                                         https://github.com/makcedward/nlpaug
PROPN                         NUM
    Percentage of samples (%)   60
                                            XSum            CNN/DM
                                                            Extrinsic error     1.0
                                                            Intrinsic error
                                40                          World knowledge     0.5
                                20
                                                                                0.0
                                0
                                     BART      PEGASUS   BART     PEGASUS             extri. intri. world other   extri. intri. world other

Figure 2: Percentage of samples with intrinsic and ex-                        Figure 3: Probability distributions of generating the
trinsic error spans for models fine-tuned from BART                           first tokens of proper nouns and numbers, grouped by
and PEGASUS on XSum and CNN/DM.                                               extrinsic errors, intrinsic errors, world knowledge, and
                                                                              other correct tokens.

4                               Summary Error Annotation and
                                Model Behavior Analysis                       tations for future development and evaluation of
                                                                              summarization models.
We first describe annotating unfaithfulness errors
by state-of-the-arts, i.e., models fine-tuned from    Low confidence generation is indicative of ex-
BART and PEGASUS on XSum and CNN/DM. trinsic errors. Inspired by recent work that stud-
We then probe into the model generation behavior      ies model prediction confidence (Liu et al., 2021),
that is indicative of errors, which guides the design we examine generation probabilities for tokens of
of negative sample construction strategies.           different part-of-speech (POS) tags. Fig. 3 shows
   600 (150 × 2 × 2) summaries are annotated to       salient results on the generation probabilities of the
demonstrate how often do the models “hallucinate", first token of a proper noun or a number (with ad-
i.e., generating content not grounded by the source. ditional analysis provided in Appendix A). As ob-
To characterize errors, we annotate text spans in     served, model confidence tends to be lower for the
summaries with (i) intrinsic errors caused by mis- first tokens of proper nouns and numbers if they are
constructing phrases or clauses from the source; part of spans with extrinsic errors. Also note that
and (ii) extrinsic errors which include words not     world knowledge, which cannot be inferred from
in the source that are either unverifiable or can- the source either, often has higher generation proba-
not be verified by Wikipedia. Content not covered     bility than extrinsic errors. Take this snippet gener-
by the article but can be validated by Wikipedia      ated by a fine-tuned BART as an example: “Manch-
is annotated as world knowledge, and the models’ ester United captain Wayne Rooney’s testimonial
behavior pattern when generating them differs from    game against Manchester City. . .”. “Manchester
when they generate errors.                            City" is an extrinsic error and “Wayne" is produced
   Two fluent English speakers with extensive ex- as world knowledge. The model assigns a low
perience in summary evaluation and error labeling     probability of 0.10 to the first token of “Manch-
are hired. For each sample, they are shown the        ester City” and a high probability of 0.92 to token
article and two system summaries, and instructed “Wayne”. This implies that model confidence can be
to annotate text spans with the aforementioned er- a useful indicator for negative sample collection.
rors and world knowledge. After labeling every
50 samples, the annotators discuss and resolve any    5 Negative Sample Construction
disagreement. The Fleiss’s Kappas on XSum and
                                                      Here we describe four strategies for constructing
CNN/DM are 0.35 and 0.45.
                                                      negative samples that modify the references (§ 5.1-
Error statistics are displayed in Fig. 2. Extrinsic   5.3) or use system generated summaries (5.4).
errors dominate both datasets, especially on XSum.
58.7% of summaries by BART (and 44.0% by PE- 5.1 Entity Swap
GASUS) contain at least one extrinsic error. No- Entity swap imitates intrinsic errors, as over 55%
ticeably, PEGASUS is a newer model pre-trained        of intrinsic errors in our annotations are found to
with a larger amount of data, thus contains less      contain named entities. We construct negative sam-
errors than BART and other older models studied       ples by swapping named entities in the references
for error annotations by Maynez et al. (2020). This   with other randomly selected entities of the same
observation also highlights the usage of our anno- entity type in the source (S WAP E NT). One sam-
R EFERENCE: A “rare” short-eared owl found emaciated in          relations (M ASK R EL). We first obtain dependency
Flintshire is now recuperating well, the RSPCA have said.
                                                                 relations using Stanza (Qi et al., 2020), with each
S WAP E NT: Flintshire → Bettisfield                             relation denoted as . To in-
⇒ A “rare” short-eared owl found emaciated in Bettisfield is
now recuperating well, the RSPCA have said.                      corporate more context, we consider noun phrase
M ASK E NT: A “rare” short-eared owl found emaciated in          spans enclosing the token of gov or dep if it is a
[MASK] is now recuperating well, the RSPCA have said.            content word and the noun phrase contains a named
⇒ A “rare” short-eared owl found emaciated in a field in South
Yorkshire is now recuperating well, the RSPCA have said.         entity. Similar to M ASK E NT, three negative sam-
M ASK R EL: A “rare” short-eared owl found [MASK] in
                                                                 ples are generated by BART based on the input
[MASK] is now recuperating well, the RSPCA have said.            with both gov and dep spans masked in the ref-
⇒ A “rare” short-eared owl found dead in London is now recu-
perating well, the RSPCA have said.
                                                                 erence. Only the samples that introduce any new
                                                                 dependency relation that is not contained in the
R EGEN E NT: A “rare” short-eared owl found emaciated in
⇒ A “rare” short-eared owl found emaciated in Notting-           source nor the reference are kept. Specifically, we
hamshire is now at a wildlife centre to recover.                 consider a match of a dependency relation as the
R EGEN R EL: A “rare” short-eared owl found                      same form or synonyms of its gov and and dep is
⇒ A “rare” short-eared owl found in the grounds of a former
coal mine is being cared for by the RSPCA in Somerset.           found in the source or the reference with the same
S YS L OW C ON: An injured golden owl found in a former coal     relation.
mine in Lancashire is being cared for by the RSPCA.                 Both M ASK E NT and M ASK R EL can create more
                                                                 extrinsic errors compared to other strategies intro-
Table 1: Negative sample construction strategies (§ 5).
                                                                 duced in this section, since negative samples are
For summaries edited from the reference, their differ-
ences are bolded. Introduced errors are in red. Text             generated without being grounded on the source
before     is the prefix for regeneration.                       articles. However, their constructed negative sam-
                                                                 ples may contained drifted topics that can be easily
                                                                 detected by a summarization model, resulting with
ple is constructed for each entity in the reference.             less efficient training signals.
Though this idea has been studied by Kryscinski
et al. (2020), they allow entities of different types            5.3   Source-conditioned Regeneration
to be used, e.g., a PERSON can be replaced by a
LOCATION. Examples are displayed in Table 1.                     To ground negative sample generation with the arti-
   S WAP E NT has the advantage of not depending                 cle, we further design a regeneration strategy based
on any trained model. Yet it only introduces intrin-             on conditional generation. For each named en-
sic errors and lacks the coverage for extrinsic errors,          tity in the reference, we treat the text before it as a
which is addressed by the following generation-                  prompt. A summarizer, e.g., fine-tuned from BART
based methods.                                                   or PEGASUS, first reads in the source using the
                                                                 encoder, then receives the prompt as the first part of
5.2   Mask-and-fill with BART                                    the decoder output, and finally decodes the rest of
To simulate extrinsic errors, we leverage large un-              the content based on nucleus sampling (Holtzman
conditional language models’ capability of convert-              et al., 2020) with a cumulative probability thresh-
ing a sequence with masked tokens into a fluent and              old of 0.7. The prompt and the regenerated text
appropriate sequence. Specifically, we replace each              comprise the final negative sample. This method is
named entity in a reference with a [MASK] token                  denoted as R EGEN E NT.
and encode it with BART (without any fine-tuning).                  We also extend entities to relations with
BART then fills this partially masked summary                    expanded governor and dependent spans
with newly generated entities (M ASK E NT). BART                 (R EGEN R EL). Here, we consider a prompt as the
is chosen since it can fill [MASK] with varying                  text before the gov or dep span, whichever occurs
number of tokens. For each entity in the reference,              first. For both R EGEN E NT and R EGEN R EL, three
we sample three summaries and only retain the                    negative samples are generated for each prompt,
ones containing at least one entity that is absent               and a sample is kept if it introduces any new entity
from both the source and the reference.                          (for R EGEN E NT) or dependency relation (for
   Up to now, the two introduced strategies both                 R EGEN R EL) with regard to the source and the
focus on incorrect named entities. To cover more                 reference.
diverse extrinsic and intrinsic errors (Goyal and                   Negative samples generated by both methods are
Durrett, 2020), we extend M ASK E NT to contain                  more relevant to the article than the mask-and-fill
strategy, yet they may still miss certain types of          Metric            XSum              CNN/DM
errors and differ from real model outputs, since                        % of Err # of Err   % of Err # of Err
they are modified from the reference summaries.             QuestEval    -0.43∗    -0.25∗    -0.33∗     -0.29∗
                                                            FactCC       -0.02     -0.15∗    -0.13∗     -0.12∗
5.4    System Generation                                    ROUGE-1      -0.16∗    -0.02     -0.03      -0.06
                                                            ROUGE-2      -0.11∗    -0.05     -0.02      -0.04
Motivated by the model confidence analysis in § 4,          ROUGE-L      -0.13∗    -0.03     -0.06      -0.08
we explore using system generated summaries as
negative samples. We first run fine-tuned BART          Table 2: Pearson correlation between metrics and error
                                                        rates and numbers of errors. ∗: p-value < 0.05.
or PEGASUS on the same training set to decode
summaries. For each summary, we check the
model confidence on the first token of each proper      score (also our metric) at the last decoding
noun and number span. If the probability is be-         step (E NTAIL R ANK). We also include three
low a threshold, we keep it as a negative sample        common methods of improving factuality: (1)
(S YS L OW C ON). The threshold is tuned by maxi-       (C ORRECTION) fine-tunes BART to fix summary
mizing F1 based on our error annotations.               errors as a separate step (Cao et al., 2020). (2)
   We consider all beams at the last decoding step      S UBSET FT fine-tunes large models based on train-
as candidates. We use beam sizes of 6 and 4 for         ing samples without any dependency relation er-
XSum and CNN/DM. Statistics of negative sam-            ror (Goyal and Durrett, 2021), with released check-
ples constructed by different strategies are in Ap-     point only available for XSum. (3) FAS UM
pendix B.                                               modifies Transformer by incorporating knowledge
                                                        graphs for factual consistency (Zhu et al., 2021),
6     Experiment Setup                                  with model outputs only on CNN/DM.
Evaluation Metrics. QuestEval (Scialom et al.,             Moreover, we compare with unlikelihood train-
2021) is used as the main metric to evaluate sum-       ing that penalizes the probabilities of all tokens in a
maries’ factual consistency. Given an article and       negative sample (Li et al., 2020). Given a negative
                                                                                                  P|y0 |
a summary, QuestEval first generates natural lan-       sample y 0 , the loss is defined as − t=1 log(1 −
guage questions for entities and nouns from both.       p(yt0 |y1:t−1
                                                                0     , x)), where p(yt0 |y1:t−1
                                                                                           0     , x) is the out-
A QA model then consumes the article to answer          put probability at the t-th step. We combine the
questions derived from the summary, producing a         unlikelihood training objective with cross-entropy
score. Another score is obtained from a QA model        loss with equal weights for fine-tuning.
addressing article-based questions after reading the       Lastly, we compare our negative sample strate-
summary. The final QuestEval score is the har-          gies with negative samples constructed for training
monic mean of the two. We use the version with          the FactCC scorer, denoted as FCS AMPLE. For
learned weights for questions, which has shown          CL only, we compare with using other samples
high correlation with human judged consistency          in the same batch as negative samples (BATCH),
and relevance.                                          a common practice for CL-based representation
   We further use FactCC (Kryscinski et al., 2020),     learning (Gao et al., 2021; Zhang et al., 2021a).
trained based on their negative sample construction
method, to measure if the summary can be entailed       7     Results
by the source. We also report ROUGE-L (Lin,             7.1     Automatic Evaluation
2004). Both FactCC and ROUGE-L reasonably
                                                        We report results by models fine-tuned from BART
correlate with summary factuality as judged by
                                                        and PEGASUS with different objectives and nega-
human (Pagnoni et al., 2021).
                                                        tive samples on XSum and CNN/DM in Tables 3
   Based on our error annotations, we report the cor-
                                                        and 4. CLIFF models use a summary representa-
relations between each metric and the error rate—
                                                        tion of averaging over all tokens with MLP projec-
percentage of tokens being part of an error span,
                                                        tion, with other variants discussed in § 7.3. Unless
and the raw number of errors (Table 2). QuestEval
                                                        explicitly stated, comparison models are fine-tuned
correlates better on both aspects than other metrics.
                                                        from the same large model used by CLIFF.
Comparisons. In addition to the models fine-               First, comparing with other factuality improve-
tuned with cross-entropy loss (C RS E NTROPY),          ment models (top of the tables), almost all CLIFF
we consider reranking beams based on FactCC             models trained with different negative samples uni-
Model                  XSum                CNN/DM              Model                  XSum                 CNN/DM
               QEval     FC    R-L QEval        FC     R-L                    QEval     FC     R-L QEval         FC     R-L
Comparisons without Negative Samples                           Comparisons without Negative Samples
C RS E NTROPY 33.09 23.92 37.14 50.94          49.07   40.82   C RS E NTROPY 32.50 25.48 39.07 50.21 44.44 40.39
E NTAIL R ANK 32.95 38.45 36.55 51.02          49.84   40.89   E NTAIL R ANK 32.42 41.90 38.47 50.15 61.04 40.67
C ORRECTION 33.12 24.14 37.11 50.93            49.06   40.82   C ORRECTION 32.55 25.15 39.02 49.48 32.96 39.79
S UBSET FT     32.25 21.83 30.35     -           -       -
FAS UM           -      -      -   50.73       50.58   37.18   Comparisons with Unlikelihood Training
                                                               FCS AMPLE      32.79 25.37 38.46 50.63           45.45   39.28
Comparisons with Unlikelihood Training                         S WAP E NT     32.88 24.76 37.91 50.43           43.02   38.96
FCS AMPLE      32.93 24.46 33.96 50.60         35.09   41.22   M ASK E NT     33.04 26.30 37.51 51.11           52.19   39.34
S WAP E NT     32.90 23.94 34.96 49.77         32.37   40.18   M ASK R EL     33.08 24.38 38.05 51.14           52.93   39.31
M ASK E NT     33.21 24.22 33.89 51.01         48.57   41.23   R EGEN E NT    32.89 24.46 38.47 51.11           52.90   39.23
M ASK R EL     33.18 23.50 34.56 51.00         48.35   41.15   R EGEN R EL    32.91 24.80 38.46 51.07           53.68   39.45
R EGEN E NT    32.41 24.12 37.08 50.97         48.59   41.07   S YS L OW C ON 31.66 26.06 34.03 50.92           51.08   39.19
R EGEN R EL    30.86 24.58 37.18 50.97         48.42   41.14
S YS L OW C ON 32.01 26.30 32.04 50.82         48.66   40.81   Our Method: CLIFF
                                                               BATCH          32.64    24.96   38.42   51.03∗   51.81∗ 39.38
Our Method: CLIFF                                              FCS AMPLE      32.96∗   25.28   38.58   51.00∗   51.80∗ 39.37
BATCH          33.18    24.88 36.76    50.99   52.18   40.98   S WAP E NT     33.09∗   25.09   38.58   51.16∗   52.97∗ 38.95
FCS AMPLE      33.15    24.50 36.72    51.02   49.62   41.06   M ASK E NT     33.09∗   25.75   38.12   51.13∗   53.60∗ 39.24
S WAP E NT     33.30    25.67∗ 35.60   51.05   50.96   40.89   M ASK R EL     33.06∗   25.28   38.37   51.17∗   53.34∗ 39.36
M ASK E NT     33.32∗   25.73∗ 36.02   50.98   49.04   41.06   R EGEN E NT    33.09∗   24.48   38.33   50.99∗   52.18∗ 39.28
M ASK R EL     33.35∗   25.69∗ 35.86   51.03   49.89   41.04   R EGEN R EL    33.16∗   24.82   38.30   51.16∗   53.21∗ 39.25
R EGEN E NT    33.15    24.64 36.32    51.04   49.91   41.11   S YS L OW C ON 33.21∗   25.18   38.18   50.85∗   53.73∗ 39.30
R EGEN R EL    33.22    25.39 36.21    50.96   49.48   41.11
S YS L OW C ON 33.35∗   25.47∗ 36.19   51.05   50.05   41.01   Table 4: Results of models fine-tuned from PEGASUS
                                                               on XSum and CNN/DM. We report results on 5, 000
Table 3: Results of models fine-tuned from BART on             randomly selected samples on CNN/DM, due to long
XSum and CNN/DM. QEval: QuestEval; FC: FactCC;                 running time of QuestEval. For models of unlikelihood
R-L: ROUGE-L. The best result per metric per dataset           training and CLIFF that use the same negative samples,
is bolded. For models of unlikelihood training and             the better of the two is highlighted with green. ∗: our
CLIFF that use the same negative samples, the better           model is significantly better than C RS E NTROPY (ap-
of the two is highlighted with green. ∗: our model is          proximation randomization test, p < 0.005).
significantly better than C RS E NTROPY (approximation
randomization test, p < 0.005).
                                                               unlikelihood training with the same negative sam-
                                                               ples. According to Table 3, using 7 negative sample
formly produce higher QuestEval scores across                  construction strategies on two datasets, CLIFF ob-
datasets with both large models, with the improve-             tains higher QuestEval scores than unlikelihood
ments more pronounced on XSum. Importantly,                    training in 12 out of the 14 comparisons. Us-
ROUGE scores for CLIFF models are compa-                       ing PEGASUS, CLIFF also outperforms in 11 se-
rable or better than baselines trained with cross-             tups as listed in Table 4. Similar trends are found
entropy, e.g., on CNN/DM as in Table 3. A similar              on FactCC and ROUGE-L. Another noteworthy
trend is observed with the FactCC metric, espe-                piece is that CLIFF’s improvements over the cross-
cially when using PEGASUS as the base model                    entropy baseline are more consistent, whereas un-
(Table 4). Note that E NTAIL R ANK tends to yield              likelihood training occasionally hurts factuality or
significantly higher FactCC scores, though it ob-              ROUGE scores significantly. We believe the key
tains lower QuestEval scores than the cross-entropy            advantage of CLIFF resides in its measure of rep-
baseline. Human inspection finds that E NTAIL -                resentation similarities between positive and nega-
R ANK can pick up beams with peculiar words of                 tive samples in the same batch, allowing models to
high FactCC scores, without improving factuality.              better differentiate between correct and erroneous
Moreover, other comparisons based on post C OR -               summaries.
RECTION and model engineering (FAS UM) only of-                   Finally, among all variants, CLIFF trained with
fer incremental gains. The sample selection-based              low confidence summaries as negative samples ob-
method, S UBSET FT, sacrifices ROUGE scores sig-               tains the best QuestEval scores on the more abstrac-
nificantly. Overall, CLIFF demonstrates stronger               tive dataset. As seen in Table 3, using low confi-
generalizability.                                              dence summaries also improves FactCC scores on
  Second, CLIFF is more effective and robust than              both datasets, and enhances ROUGE-L on the more
Strategy                 XSum                CNN/DM                                                                  Inform.       Factual.
                                                                 Model                                           Win↑ Tie Lose↓ Win↑ Tie Lose↓
                 QEval    FC     R-L QEval        FC     R-L
                                                                 E NTAIL R ANK                                    3.3      84.3   12.3     23.7    71.7 4.7
S YS L OW C ON   33.35   25.47   36.19   51.05   50.05   41.01   ULL. M ASK E NT                                  6.3      80.7   13.0     26.3    62.0 11.7
+ S WAP E NT     33.40   25.50   35.50   51.32   53.95   40.57   CL. BATCH                                        5.3      80.0   14.7     21.7    68.0 10.3
+ M ASK E NT     33.21   25.47   35.91   51.16   51.90   40.66   CL. S YS L OW C ON                               8.7      78.3   13.0     31.3    61.7 7.0
+ M ASK R EL     33.39   25.20   35.70   51.24   52.48   40.80
+ R EGEN E NT    33.31   25.07   35.94   51.21   51.86   40.91                                                          (a) XSum
+ R EGEN R EL    33.38   24.97   36.03   51.13   50.85   40.97
                                                                                                                     Inform.       Factual.
                                                                 Model                                           Win↑ Tie Lose↓ Win↑ Tie Lose↓
Table 5: Results of fine-tuned BART with combina-
                                                                 E NTAIL R ANK      2.3 86.3 11.3                                           4.7    94.7       0.7
tions of negative sample construction strategies.
                                                                 ULL. M ASK E NT 18.0 71.0 11.0                                            17.3    79.7       3.0
                                                                 CL. BATCH          17.3 74.7 8.0                                          20.7    77.0       2.3
                                                                 CL. S YS L OW C ON 15.7 75.7 8.7                                          20.0    77.7       2.3
extractive dataset CNN/DM. This indicates that sys-
                                                                                                                   (b) CNN/DM
tem generated summaries contribute more diverse
errors made by existing models organically, which                Table 6: Percentages of summaries that are better than,
are particularly suitable for our CL framework. As               tied with, or worse than C RS E NTROPY, in informative-
we use summaries generated by the same model                     ness (Inform.) and factual consistency (Factual.) The
for CLIFF training, one future direction is to use               Krippendorff’s αs are 0.33 and 0.62 for the two aspects
                                                                 on XSum, and 0.34 and 0.89 on CNN/DM. Our CL
outputs by different models. For our mask-and-fill
                                                                 method using low confidence summaries is more fre-
and source-conditioned regeneration strategies, we               quently rated as better for informativeness and factual-
find that relation-anchored construction often beats             ity on the more abstractive dataset XSum.
their entity-anchored counterparts. This calls for ef-
forts that steer the entity-driven methods to a more
                                                                 Percentage of samples (%)

                                                                                                          XSum                           CNN/DM
relation-focused direction.                                                                  60
Combining Strategies. We further show results                                                40
by fine-tuning BARTs using samples based on com-
                                                                                             20
bined negative sample construction strategies in
Table 5. As can be seen, combining S YS L OW C ON                                            0
                                                                                                  Intrinsic    Extrinsic           Intrinsic      Extrinsic
and other strategies yields better QuestEval scores                                               CrsEntropy            ULL.MaskEnt            CL.SysLowCon
than models trained with negative samples by any                                                  EntailRank            CL.Batch
single strategy, except for M ASK E NT and R EGE -
                                                                 Figure 4: Portions of summaries with errors. CL mod-
N E NT on XSum. This signifies the importance of                 els consistently reduce both types of errors.
covering diverse types of errors in negative sam-
ples.                                                            structed by M ASK E NT, and CLIFF models trained
7.2   Human Evaluation                                           with BATCH and S YS L OW C ON negative samples.
                                                                 All are fine-tuned from BART. Detailed evaluation
Pairwise Comparison with Cross-entropy. We                       guidelines are in Appendix D.
recruit the two human annotators for our summary                    Table 6 shows that on the more abstractive XSum
error study, as well as another experienced annota-              data CL trained with low confidence samples are
tor, to evaluate summary informativeness and fac-                more frequently rated as being more informative
tual consistency. For each article, the judges are               and more factual than C RS E NTROPY summaries.
shown summaries generated by the C RS E NTROPY                   This echos our automatic evaluations with QuestE-
model and four other systems. They then rate each                val in § 7.1. On CNN/DM, all models trained
system summary against the C RS E NTROPY sum-                    with negative samples produce summaries with bet-
mary. All four summaries generated by different                  ter informativeness and faithfulness. In contrast,
factuality-improved models are shown in random                   E NTAIL R ANK summaries are less distinguishable
order without system names shown, ensuring the                   from outputs by C RS E NTROPY on both datasets,
fair comparison among them.                                      as more ties are found. We show sample outputs in
   We randomly pick 100 articles from each dataset               Fig. 1, with additional examples in Appendix E.
used in our error analysis study in § 4, and evaluate
summaries generated by E NTAIL R ANK, unlikeli-                  Intrinsic vs. Extrinsic Errors. Next, the annota-
hood training (ULL) with negative samples con-                   tors are asked to label text spans with intrinsic and
XSum                   CNN/DM                        S WAP E NT       M ASK R EL     S YS L OW C ON
  CL.SLC                                                     Rep.     MLP QEval FC         QEval FC        QEval FC
      CL.B                                                   BART
                                                             Last      X   33.15   25.10   33.20   25.29   33.10   24.85
      ULL.                                                   Last          +0.13   +0.02   –0.01   –0.32   –0.07   –0.10
 EntRank.                                                    Entity    X   33.35   25.41   33.34   25.44   33.32   24.46
                                                             Entity        –0.13   –0.07   –0.14   –0.05   –0.29   +0.72
             0               35%        0             25%    All       X   33.30   25.67   33.35   25.69   33.35   25.47
                 Deletion     Substitution     Both          All           –0.23   –0.80   –0.04   –0.48   –0.26   –0.40

Figure 5: Summaries use different portions of error          PEGASUS
                                                             Last   X      33.07   25.45   32.99   25.09   33.18   24.94
correction operations. Contrastive learning with S YS -
                                                             Last          –0.07   –0.56   +0.01   -0.01   –0.02   –0.04
L OW C ON (CL.SLC) and BATCH (CL.B) substitute er-           Entity X      33.03   25.43   33.05   24.77   33.20   24.59
rors with correct content more often than unlikelihood       Entity        +0.01   –0.34   –0.05   +0.64   –0.30   +0.05
training with M ASK E NT and E NTAIL R ANK.                  All    X      33.09   25.09   33.06   25.28   33.21   25.18
                                                             All           –0.11   +0.25   +0.03   -0.19   –0.02   –0.80

                                                            Table 7: Comparing different formulations of summary
extrinsic errors as done in § 4. Fig. 4 shows that CL
                                                            representation in CL. For models without MLP, we dis-
is more effective at reducing extrinsic errors than         play score changes from their counterparts. Overall, us-
unlikelihood training can on both datasets. We also         ing all tokens with MLP produces better summaries.
observe slight decreases of world knowledge in the
summaries (figure attached in Appendix D).                  8   Conclusion
                                                            We present CLIFF, a contrastive learning-based
Error Correction Operations. Finally, with ref-
                                                            framework to promote faithfulness and factuality
erence to C RS E NTROPY summaries, human judges
                                                            of abstractive summaries. CLIFF uses both refer-
are instructed to label each system summary as
                                                            ences and summaries that are factually inconsistent
whether it corrects any error by C RS E NTROPY us-
                                                            with the articles to train systems to be better at
ing deletion of the incorrect content, substitution
                                                            discriminating errors from factual and salient con-
with factual information, or both. As seen in Fig. 5,
                                                            tent. We further study strategies that automatically
CL-based models restore factually consistent infor-
                                                            create erroneous summaries by editing from refer-
mation, e.g., by replacing erroneous names and
                                                            ences or leveraging systems outputs, inspired by
numbers with correct ones, more frequently than
                                                            our new summary error analysis on state-of-the-
unlikelihood training or entailment reranking.
                                                            art models. Both automatic evaluation and human
                                                            ratings show that CLIFF achieves consistent im-
7.3     Variants of Summary Representation                  provements over competitive comparison methods,
                                                            and is generalizable across datasets with systems
Sample representation is critical for CL to be ef-          fine-tuned from different large models.
fective. Here we investigate summary representa-
tion variants as discussed in § 3. There are two            Acknowledgements
major considerations: (1) Should we consider all            This research is supported in part by Oracle for
tokens in a summary or only representative ones             Research Cloud Credits, National Science Founda-
(e.g., entities or last token)? (2) Should additional       tion through a CAREER award IIS-2046016, and
transformation, i.e., an MLP, be used?                      the Office of the Director of National Intelligence
   Experiments on XSum using three negative sam-            (ODNI), Intelligence Advanced Research Projects
ple construction strategies demonstrate that aver-          Activity (IARPA), via contract # FA8650-17-C-
aging the decoder outputs of all tokens and adding          9116. The views and conclusions contained herein
an MLP projection yield the best overall perfor-            are those of the authors and should not be inter-
mance, as shown in Table 7. The implications are            preted as necessarily representing the official poli-
at least two-fold. First, even for entity- or relation-     cies, either expressed or implied, of ODNI, IARPA,
triggered sample modifications, using more global           or the U.S. Government. The U.S. Government is
context helps with CL training. Second, additional          authorized to reproduce and distribute reprints for
transformation can help avoid model degeneration.           governmental purposes notwithstanding any copy-
For instance, more nonsensical and repetitive con-          right annotation therein. We thank three anony-
tent is produced by variants without MLP.                   mous reviewers for their valuable suggestions.
References                                                 Alexander R Fabbri, Wojciech Kryściński, Bryan
                                                             McCann, Caiming Xiong, Richard Socher, and
Rahul Aralikatte, Shashi Narayan, Joshua Maynez,             Dragomir Radev. 2021. Summeval: Re-evaluating
  Sascha Rothe, and Ryan McDonald. 2021. Focus at-           summarization evaluation. Transactions of the Asso-
  tention: Promoting faithfulness and diversity in sum-      ciation for Computational Linguistics, 9:391–409.
  marization. In Proceedings of the 59th Annual Meet-
  ing of the Association for Computational Linguistics     Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
  and the 11th International Joint Conference on Nat-        Utama, Ido Dagan, and Iryna Gurevych. 2019.
  ural Language Processing (Volume 1: Long Papers),          Ranking generated summaries by correctness: An in-
  pages 6078–6095, Online. Association for Computa-          teresting but challenging application for natural lan-
  tional Linguistics.                                        guage inference. In Proceedings of the 57th Annual
                                                             Meeting of the Association for Computational Lin-
Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit           guistics, pages 2214–2220, Florence, Italy. Associa-
 Cheung. 2020. Factual error correction for abstrac-         tion for Computational Linguistics.
 tive summarization models. In Proceedings of the
 2020 Conference on Empirical Methods in Natural           Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan
 Language Processing (EMNLP), pages 6251–6258,               Ding, and Pengtao Xie. 2020. Cert: Contrastive self-
 Online. Association for Computational Linguistics.          supervised learning for language understanding.
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.     Jennifer Foster and Oistein Andersen. 2009. Gen-
  Faithful to the original: Fact aware neural abstrac-        ERRate: Generating errors for use in grammatical
  tive summarization. In Proceedings of the AAAI              error detection. In Proceedings of the Fourth Work-
  Conference on Artificial Intelligence, volume 32.           shop on Innovative Use of NLP for Building Edu-
                                                              cational Applications, pages 82–90, Boulder, Col-
Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth.
                                                              orado. Association for Computational Linguistics.
   2021. Improving faithfulness in abstractive summa-
   rization with contrast candidate generation and se-     Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin
   lection. In Proceedings of the 2021 Conference of         Choi, and Jianfeng Gao. 2021. GO FIGURE: A
   the North American Chapter of the Association for         meta evaluation of factuality in summarization. In
  Computational Linguistics: Human Language Tech-            Findings of the Association for Computational Lin-
   nologies, pages 5935–5941, Online. Association for        guistics: ACL-IJCNLP 2021, pages 478–487, On-
   Computational Linguistics.                                line. Association for Computational Linguistics.
Ting Chen, Simon Kornblith, Mohammad Norouzi,              Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
  and Geoffrey Hinton. 2020. A simple framework               Simcse: Simple contrastive learning of sentence em-
  for contrastive learning of visual representations. In      beddings.
  Proceedings of the 37th International Conference
  on Machine Learning, volume 119 of Proceedings           Tanya Goyal and Greg Durrett. 2020. Evaluating fac-
  of Machine Learning Research, pages 1597–1607.             tuality in generation with dependency-level entail-
  PMLR.                                                      ment. In Findings of the Association for Computa-
                                                             tional Linguistics: EMNLP 2020, pages 3592–3603,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                Online. Association for Computational Linguistics.
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-     Tanya Goyal and Greg Durrett. 2021. Annotating and
   standing. In Proceedings of the 2019 Conference           modeling fine-grained factuality in summarization.
   of the North American Chapter of the Association          In Proceedings of the 2021 Conference of the North
   for Computational Linguistics: Human Language             American Chapter of the Association for Computa-
  Technologies, Volume 1 (Long and Short Papers),            tional Linguistics: Human Language Technologies,
   pages 4171–4186, Minneapolis, Minnesota. Associ-          pages 1449–1462, Online. Association for Compu-
   ation for Computational Linguistics.                      tational Linguistics.
Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng,                Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin
  Jackie Chi Kit Cheung, and Jingjing Liu. 2020.             Stoyanov. 2021. Supervised contrastive learning for
  Multi-fact correction in abstractive text summariza-       pre-trained language model fine-tuning. In Interna-
  tion. In Proceedings of the 2020 Conference on             tional Conference on Learning Representations.
  Empirical Methods in Natural Language Process-
  ing (EMNLP), pages 9320–9331, Online. Associa-           Tianxing He and James Glass. 2020. Negative train-
  tion for Computational Linguistics.                         ing for neural dialogue response generation. In Pro-
                                                              ceedings of the 58th Annual Meeting of the Asso-
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A              ciation for Computational Linguistics, pages 2044–
  question answering evaluation framework for faith-          2058, Online. Association for Computational Lin-
  fulness assessment in abstractive summarization. In         guistics.
  Proceedings of the 58th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 5055–       Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
  5070, Online. Association for Computational Lin-           stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
  guistics.                                                  and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In C. Cortes, N. D. Lawrence,              and comprehension. In Proceedings of the 58th An-
  D. D. Lee, M. Sugiyama, and R. Garnett, editors,           nual Meeting of the Association for Computational
  Advances in Neural Information Processing Systems          Linguistics, pages 7871–7880, Online. Association
  28, pages 1693–1701. Curran Associates, Inc.               for Computational Linguistics.

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-                Jia Li, Chongyang Tao, Wei Wu, Yansong Feng,
  Marchildon, Karan Grewal, Phil Bachman, Adam                Dongyan Zhao, and Rui Yan. 2019. Sampling mat-
  Trischler, and Yoshua Bengio. 2019. Learning deep           ters! an empirical study of negative sampling strate-
  representations by mutual information estimation            gies for learning of matching models in retrieval-
  and maximization. In International Conference on            based dialogue systems. In Proceedings of the
  Learning Representations.                                   2019 Conference on Empirical Methods in Natu-
                                                              ral Language Processing and the 9th International
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and            Joint Conference on Natural Language Processing
  Yejin Choi. 2020. The curious case of neural text de-       (EMNLP-IJCNLP), pages 1291–1296, Hong Kong,
  generation. In International Conference on Learn-           China. Association for Computational Linguistics.
  ing Representations.
                                                           Margaret Li, Stephen Roller, Ilia Kulikov, Sean
Matthew Honnibal, Ines Montani, Sofie Van Lan-              Welleck, Y-Lan Boureau, Kyunghyun Cho, and Ja-
 deghem, and Adriane Boyd. 2020.            spaCy:          son Weston. 2020. Don’t say that! making inconsis-
 Industrial-strength Natural Language Processing in         tent dialogue unlikely with unlikelihood training. In
 Python.                                                    Proceedings of the 58th Annual Meeting of the Asso-
                                                            ciation for Computational Linguistics, pages 4715–
Luyang Huang, Lingfei Wu, and Lu Wang. 2020.                4728, Online. Association for Computational Lin-
  Knowledge graph-augmented abstractive summa-              guistics.
  rization with semantic-driven cloze reward. In Pro-
  ceedings of the 58th Annual Meeting of the Asso-         Chin-Yew Lin. 2004. ROUGE: A package for auto-
  ciation for Computational Linguistics, pages 5094–         matic evaluation of summaries. In Text Summariza-
  5107, Online. Association for Computational Lin-           tion Branches Out, pages 74–81, Barcelona, Spain.
  guistics.                                                  Association for Computational Linguistics.

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron           Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao,
  Sarna, Yonglong Tian, Phillip Isola, Aaron                  Zhifang Sui, Weizhu Chen, and Bill Dolan. 2021.
  Maschinot, Ce Liu, and Dilip Krishnan. 2020. Su-           A token-level reference-free hallucination detection
  pervised Contrastive Learning. In Advances in               benchmark for free-form text generation.
  Neural Information Processing Systems, volume 33,        Yang Liu and Mirella Lapata. 2019. Text summariza-
  pages 18661–18673. Curran Associates, Inc.                 tion with pretrained encoders. In Proceedings of
                                                             the 2019 Conference on Empirical Methods in Nat-
Sosuke Kobayashi. 2018. Contextual augmentation:             ural Language Processing and the 9th International
  Data augmentation by words with paradigmatic re-           Joint Conference on Natural Language Processing
  lations. In Proceedings of the 2018 Conference of          (EMNLP-IJCNLP), pages 3730–3740, Hong Kong,
  the North American Chapter of the Association for          China. Association for Computational Linguistics.
  Computational Linguistics: Human Language Tech-
  nologies, Volume 2 (Short Papers), pages 452–457,        Yixin Liu and Pengfei Liu. 2021. SimCLS: A sim-
  New Orleans, Louisiana. Association for Computa-           ple framework for contrastive learning of abstractive
  tional Linguistics.                                        summarization. In Proceedings of the 59th Annual
                                                             Meeting of the Association for Computational Lin-
Wojciech Kryscinski, Bryan McCann, Caiming Xiong,            guistics and the 11th International Joint Conference
 and Richard Socher. 2020. Evaluating the factual            on Natural Language Processing (Volume 2: Short
 consistency of abstractive text summarization. In           Papers), pages 1065–1072, Online. Association for
 Proceedings of the 2020 Conference on Empirical             Computational Linguistics.
 Methods in Natural Language Processing (EMNLP),
 pages 9332–9346, Online. Association for Computa-         Jonathan Mallinson, Rico Sennrich, and Mirella Lap-
 tional Linguistics.                                         ata. 2017. Paraphrasing revisited with neural ma-
                                                             chine translation. In Proceedings of the 15th Con-
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,               ference of the European Chapter of the Association
  Kevin Gimpel, Piyush Sharma, and Radu Soricut.             for Computational Linguistics: Volume 1, Long Pa-
  2020. Albert: A lite bert for self-supervised learning     pers, pages 881–893, Valencia, Spain. Association
  of language representations. In International Con-         for Computational Linguistics.
  ference on Learning Representations.
                                                           Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-                     Ryan McDonald. 2020. On faithfulness and factu-
  jan Ghazvininejad, Abdelrahman Mohamed, Omer                ality in abstractive summarization. In Proceedings
  Levy, Veselin Stoyanov, and Luke Zettlemoyer.               of the 58th Annual Meeting of the Association for
  2020. BART: Denoising sequence-to-sequence pre-            Computational Linguistics, pages 1906–1919, On-
  training for natural language generation, translation,      line. Association for Computational Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-         copo, and Wang Alex. 2021. Questeval: Summariza-
  rado, and Jeffrey Dean. 2013. Distributed represen-      tion asks for fact-based evaluation. arXiv preprint
  tations of words and phrases and their composition-      arXiv:2103.12693.
  ality.
                                                         Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero            wowarski, and Jacopo Staiano. 2019. Answers
  Nogueira dos Santos, Henghui Zhu, Dejiao Zhang,          unite! unsupervised metrics for reinforced summa-
  Kathleen McKeown, and Bing Xiang. 2021. Entity-          rization models. In Proceedings of the 2019 Con-
  level factual consistency of abstractive text summa-     ference on Empirical Methods in Natural Language
  rization. In Proceedings of the 16th Conference of       Processing and the 9th International Joint Confer-
  the European Chapter of the Association for Com-         ence on Natural Language Processing (EMNLP-
  putational Linguistics: Main Volume, pages 2727–         IJCNLP), pages 3246–3256, Hong Kong, China. As-
  2733, Online. Association for Computational Lin-         sociation for Computational Linguistics.
  guistics.
                                                         Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.         Asking and answering questions to evaluate the fac-
  2018. Don’t give me the details, just the summary!       tual consistency of summaries. In Proceedings of
  topic-aware convolutional neural networks for ex-        the 58th Annual Meeting of the Association for Com-
  treme summarization. In Proceedings of the 2018          putational Linguistics, pages 5008–5020, Online.
  Conference on Empirical Methods in Natural Lan-          Association for Computational Linguistics.
  guage Processing, pages 1797–1807, Brussels, Bel-
  gium. Association for Computational Linguistics.       Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
                                                           nan, Kyunghyun Cho, and Jason Weston. 2020. Neu-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela            ral text generation with unlikelihood training. In
 Fan, Sam Gross, Nathan Ng, David Grangier, and            International Conference on Learning Representa-
 Michael Auli. 2019. fairseq: A fast, extensible           tions.
 toolkit for sequence modeling. In Proceedings of
 NAACL-HLT 2019: Demonstrations.                         Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
                                                           Chaumond, Clement Delangue, Anthony Moi, Pier-
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia         ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  Tsvetkov. 2021. Understanding factuality in abstrac-     icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  tive summarization with frank: A benchmark for fac-      Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  tuality metrics.                                         Teven Le Scao, Sylvain Gugger, Mariama Drame,
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-      Quentin Lhoest, and Alexander M. Rush. 2020.
  Yan Zhu. 2020. Contrastive learning for unpaired         Transformers: State-of-the-art natural language pro-
  image-to-image translation. In European Confer-          cessing. In Proceedings of the 2020 Conference on
  ence on Computer Vision, pages 319–345. Springer.        Empirical Methods in Natural Language Processing:
                                                           System Demonstrations, pages 38–45, Online. Asso-
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,           ciation for Computational Linguistics.
  and Christopher D. Manning. 2020. Stanza: A
  python natural language processing toolkit for many    Dejiao Zhang, Feng Nan, Xiaokai Wei, Shangwen Li,
  human languages. In Proceedings of the 58th An-          Henghui Zhu, Kathleen McKeown, Ramesh Nallap-
  nual Meeting of the Association for Computational        ati, Andrew Arnold, and Bing Xiang. 2021a. Sup-
  Linguistics: System Demonstrations, pages 101–           porting clustering with contrastive learning.
  108, Online. Association for Computational Linguis-
  tics.                                                  Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak
                                                           Lee, and Yinfei Yang. 2021b. Cross-modal con-
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.          trastive learning for text-to-image generation. arXiv
  2019. Generating natural language adversarial ex-        preprint arXiv:2101.04702.
  amples through probability weighted word saliency.
  In Proceedings of the 57th Annual Meeting of the       Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
  Association for Computational Linguistics, pages          ter Liu. 2020a. Pegasus: Pre-training with extracted
  1085–1097, Florence, Italy. Association for Compu-        gap-sentences for abstractive summarization. In In-
  tational Linguistics.                                     ternational Conference on Machine Learning, pages
                                                            11328–11339. PMLR.
Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora,
  Mikhail Khodak, and Hrishikesh Khandeparkar.           Yuhao Zhang, Derek Merck, Emily Tsai, Christo-
  2019. A theoretical analysis of contrastive unsu-        pher D. Manning, and Curtis Langlotz. 2020b. Op-
  pervised representation learning. In International       timizing the factual correctness of a summary: A
  Conference on Machine Learning, pages 5628–5637.         study of summarizing radiology reports. In Pro-
  PMLR.                                                    ceedings of the 58th Annual Meeting of the Asso-
                                                           ciation for Computational Linguistics, pages 5108–
Thomas Scialom, Paul-Alexis Dray, Gallinari Patrick,       5120, Online. Association for Computational Lin-
  Lamprier Sylvain, Piwowarski Benjamin, Staiano Ja-       guistics.
NOUN                        VERB                               PROPN                            NUM
    1.0                                                              1.0

    0.5                                                              0.5
                                                                     0.0
    0.0
                                                                           extri. intri. world other     extri. intri. world other
           extri. intri. world other   extri. intri. world other                     NOUN                           VERB
                                                                     1.0
Figure 6: Probability distributions of generating the
first tokens of nouns and verbs, grouped by extrinsic                0.5
errors, intrinsic errors, world knowledge, and other cor-
                                                                     0.0
rect tokens. No verb is annotated as world knowledge.
                                                                           extri. intri. world other     extri. intri. world other

Zheng Zhao, Shay B. Cohen, and Bonnie Webber. 2020.                Figure 7: Probability distributions of generating the
  Reducing quantity hallucinations in abstractive sum-             non-first tokens of proper nouns, numbers, nouns, and
  marization. In Findings of the Association for Com-              verbs, grouped by extrinsic errors, intrinsic errors,
  putational Linguistics: EMNLP 2020, pages 2237–                  world knowledge, and other correct tokens. Non-first
  2249, Online. Association for Computational Lin-                 tokens do not exist for numbers and verbs, as they only
  guistics.
                                                                   contain single tokens.
Chenguang Zhu, William Hinthorn, Ruochen Xu,
  Qingkai Zeng, Michael Zeng, Xuedong Huang, and
  Meng Jiang. 2021. Enhancing factual consistency                          Dataset          Train      Validation      Test
  of abstractive summarization. In Proceedings of the                      XSum           204,045       11,332        11,334
  2021 Conference of the North American Chapter of                         CNN/DM         287,227       13,368        11,490
  the Association for Computational Linguistics: Hu-
  man Language Technologies, pages 718–733, On-                    Table 8: Numbers of samples in train/validation/test
  line. Association for Computational Linguistics.                 splits of XSum and CNN/DM.
A         Additional Analysis for Summary
          Error Annotation
                                                                   Positive Samples. We observe unfaithful para-
We hire two fluent English speakers to annotate                    phrases by back-translation for some reference
summary errors on XSum and CNN/DailyMail                           summaries, which are mainly due to the introduc-
(CNN/DM). They annotate a common batch of 100                      tion of new entities and the rewriting of quoted
summaries generated by summarizers fine-tuned                      text. Thus, we discard samples generated by back-
from BART and PEGASUS, with 50 articles in                         translation that contain new entities and inconsis-
each batch. The two annotators are shown 50                        tent quoted text. Finally, we obtain 182,114 and
HTML pages in a batch, each of which contains                      91,468 positive samples by back-translation on
an article and two summaries generated by the two                  XSum and CNN/DM.
models. The detailed annotation guideline is given
in Fig. 9.                                                         Negative Samples. For consistency, we use the
   For our analysis on token generation probabili-                 summarizer fine-tuned from BART in R EGEN E NT,
ties, we additionally show the distributions of the                R EGEN R EL (§ 5.3), and S YS L OW C ON (§ 5.4)
first token’s probability for nouns and verbs in                   strategies. We tune a threshold to select negative
Fig. 6. We also report the distributions of the non-               samples from model generations in our S YS L OW-
first token’s probability for proper nouns, numbers,               C ON strategy. The threshold is set to 0.21, with F1
nouns, and verbs in Fig. 7. As can be seen, tokens                 scores of 73.99 and 40.49 on XSum and CNN/DM
within extrinsic and intrinsic errors have high gen-               annotated samples generated by BART.
eration probabilities when they are non-first tokens.                 The numbers of negative samples constructed by
                                                                   each strategy for training on XSum and CNN/DM
B         Statistics for Datasets and Training
                                                                   are shown in Table 9. S YS L OW C ON constructs the
          Samples
                                                                   least negative samples in total, while it achieves the
Summarization Datasets. We follow the official                     best results as reported in our main paper (§ 7.1), in-
data splits for the two datasets, with the number of               dicating that its negative samples are more effective
samples in each split listed in Table 8.                           for training.
You can also read