Go Figure! A Meta Evaluation of Factuality in Summarization

 
CONTINUE READING
Go Figure! A Meta Evaluation of Factuality in Summarization

                                             Saadia Gabriel♠ Asli Celikyilmaz♣ Rahul Jha♣ Yejin Choi♠♦ Jianfeng Gao ♣
                                              ♠
                                                Paul G. Allen School of Computer Science & Engineering, University of Washington
                                                                              ♣
                                                                                Microsoft Research
                                                                    ♦
                                                                      Allen Institute for Artificial Intelligence
                                                                {skgabrie,yejin}@cs.washington.edu
                                                                {aslicel,rajh,jfgao}@microsoft.com

                                                               Abstract
                                             Text generation models can generate factually
                                             inconsistent text containing distorted or fabri-
arXiv:2010.12834v1 [cs.CL] 24 Oct 2020

                                             cated facts about the source text. Recent work
                                             has focused on building evaluation models to
                                             verify the factual correctness of semantically
                                             constrained text generation tasks such as docu-
                                             ment summarization. While the field of factu-
                                             ality evaluation is growing fast, we don’t have
                                             well-defined criteria for measuring the effec-
                                             tiveness, generalizability, reliability, or sensi-
                                             tivity of the factuality metrics. Focusing on
                                             these aspects, in this paper, we introduce a
                                             meta-evaluation framework for evaluating fac-
                                             tual consistency metrics. We introduce five
                                             necessary, common-sense conditions for ef-
                                                                                                  Figure 1: Example of a ground-truth CNN/DailyMail sum-
                                             fective factuality metrics and experiment with       mary and transformed summary where key spans of the
                                             nine recent factuality metrics using synthetic       ground-truth summary (highlighted in green) contain factual
                                             and human-labeled factuality data from short         errors (highlighted in red). Even though the transformed
                                             news, long news and dialogue summarization           summary is less factual, the commonly used ROUGE summa-
                                                                                                  rization metric assigns higher values to that summary over the
                                             domains. Our framework enables assessing
                                                                                                  ground-truth summary when we compare against the original
                                             the efficiency of any new factual consistency        article as a reference.
                                             metric on a variety of dimensions over mul-
                                             tiple summarization domains and can be eas-
                                             ily extended with new meta-evaluation criteria.      with up to 30% factual inconsistencies (Kryscinski
                                             We also present our conclusions towards stan-        et al., 2019; Falke et al., 2019a; Zhu et al., 2020).
                                             dardizing the factuality evaluation metrics.            Commonly used metrics for measuring quality
                                                                                                  of generated text fail to capture structural aspects
                                         1   Introduction
                                                                                                  of language and poorly correlate with human judge-
                                         The goal of text generation systems is to produce        ments (Hashimoto et al., 2019; Clark et al., 2019).
                                         text that is fluent, coherent, relevant, as well as      As shown by Figure 1, simple transformations like
                                         factually correct. Recent progress in neural ap-         copying filler terms from the source text and intro-
                                         proaches to building semantically constraint text        ducing logical negations to transform a factually
                                         generation systems has shown tremendous improve-         grounded summary into a factually inconsistent
                                         ments in this direction (Liu and Lapata, 2019; Guo       summary can lead to a higher ROUGE score for
                                         et al., 2018; Durmus et al., 2020; Wang et al., 2020).   the less factual summary when we compare the
                                         However, an important issue in text generation sys-      candidate summaries against the original source
                                         tems is that they can yield factually inconsistent       document.
                                         text, caused by somewhat distorted or fabricated            The last few years has observed an increase in
                                         facts about the source text. Especially in document      research papers on factual consistency evaluation
                                         summarization tasks, the models that abstract away       metrics due to these reasons. A number of metrics
                                         salient aspects, have been shown to generate text        have been proposed for measuring factuality via
proxy objectives like question-answering (QA) and              2     Factuality Metric Meta Evaluation
facet overlap (Scialom et al., 2019; Durmus et al.,
2020; Mao et al., 2020), raising a number of new               Since reference summaries may be an incomplete
research questions including: “Do these metrics                representation of the salient facts3 in a source doc-
capture multiple aspects and dimensions of factu-              ument or unavailable, we consider factuality in
ality in summarization?” and “Do these metrics                 terms of how well candidate summaries are factu-
capture factuality across a broader spectrum of                ally grounded with respect to the source document
domains?” It is unclear which of these metrics are             rather than reference summaries. We also assume
suitable for evaluating the types of text generation           that source documents are factually valid, without
methods. We think that answering these questions               the use of external sources or databases for fact ver-
is key to determining the usability and effectiveness          ification. We define a summary as having factual
of recent factuality metrics, especially when con-             inconsistency level i if there are i errors present (i.e.
sidering previously under-explored summarization               a summary with no errors will have a factual incon-
domains like dialogue summarization.                           sistency level 0, a summary with 1 error will have
   In this work, we propose the first evalua-                  an inconsistency level 1, etc).4 In section 2.1, we
tion framework, GO-FIGURE1 , a meta-evaluation                 propose a set of necessary conditions for defining
framework for assessing the effectiveness of fac-              an effective factuality metric including theoretical
tuality metrics across multiple dimensions and                 constraints on metric values and commonsense con-
domains - extreme news summarization, multi-                   ditions. In sections 3.2.1 and 3.2.2 we describe the
sentence news summarization, and dialogue sum-                 construction of diagnostic datasets to test these con-
marization.2 While most of the prior work in fac-              ditions in both a simulated and generated setting.
tual consistency concentrated on one dataset and               In section 2.2 we elaborate on how the conditions
domain, our framework allows us to test the ro-                defined by our framework can be practically ap-
bustness and accuracy of proposed metrics for eval-            plied to measure sensitivity of metrics to changes
uating factual consistency across domains to de-               in factuality.
termine if metrics truly generalize. We primarily
                                                               2.1    How do we define a “good” factuality
focus on summarization rather than open-ended
                                                                      metric?
generation, since the source document provides a
natural grounding for factuality.                              We define a set of five conditions for a text genera-
   Our contributions are as follows: (i) a set of di-          tion metric M (D, Si ) that can effectively measure
agnostics for measuring sensitivity of metrics to              factual consistency of a summary Si with respect
different levels of factual inconsistency (i.e. are            to a source document D:
there statistically significant differences between               Boundedness (Condition I). We define this for-
metric results for less factual generations vs. more           mally such that if Sf is a completely factual sum-
factual generations?), as well as sensitivity of met-          mary and Sr is a completely factually inconsistent
rics to types of factual errors (i.e. which lexical and        summary,5 we expect that
semantic changes do metrics better capture?), (ii)
a synthetic evaluation dataset of context/summary                    M (D, Sr ) ≤ M (D, Si ) ≤ M (D, Sf ).                 (1)
pairs from three summarization domains for mea-
suring effectiveness of new factuality metrics. The            In other words, the metric values should be
evaluation dataset contains different levels of in-            reasonably bounded and these bounds should
jected factual errors to simulate errors made by               relate to the factuality of candidate summaries.
generation models, and finally (iii) an evaluation
dataset of summaries generated by transformer-                     3
                                                                     Following Kryscinski et al. (2019), we define facts as
based models (Raffel et al., 2019; Rothe et al.,               salient spans in a source document, i.e. spans of text high-
2019) annotated with types of factual errors. This             lighting key information pertaining to entities in the source
                                                               documents, actions that were performed, statistical reporting,
provides a test-bed capturing the real distribution            etc. These spans either support or refute claims in summaries.
of errors made by generation models.                               4
                                                                     For the simulated data, it is not always possible to intro-
                                                               duce i errors (See section 3.2.1). In this case, i is the maximum
   1
     General Outline for Factuality In Generative              number of errors.
UndeRstanding Evaluation.                                          5
                                                                     We define a summary that is completely factually incon-
   2
     We will publicly release code for our evaluation frame-   sistent as a summary that contains no facts that overlap with
work and diagnostic datasets soon.                             the facts in the source document.
Stat                                    XSUM            CNNDM             SAMSUMM
                    Avg #Words (summ/source)                22.06/393.72    63.41/758.13      20.72/94.61
                    Avg #Entities (summ/source)             2.82/40.87      6.63/51.75        3.76/10.6
                    Avg #Pronoun words (summ/source)        0.43/13.16      2.09/30.55        0.94/7.45
                    Avg #Verbs (summ/source)                2.49/46.75      7.65/84.39        2.99/10.75
                    Avg #Adjectives (summ/source)           1.44/27.56      3.55/46.13        1.01/4.85

Table 1: Dataset statistics for summaries (summ) and source documents (source) in the evaluation sets. Corresponding values
for summaries are given on the left and values for source documents are given on the right.

  Sensitivity (Condition II). We define fac-                     factually consistent than a summary Sj , then we
tual inconsistency level as a measure that indicate              expect that
the differences between metric results, e.g.,
between less factual generations vs. more factual                                H(D, Si ) ≥ H(D, Sj ).                (3)
generations. We calculate the sensitivity score
for a given metric based on the magnitude of                     where, H is the human judgement score on factu-
the slope of the best-fit line between the factual               ality. In other words, the metric should correlate
inconsistency level and average metric values (i.e.              with human judgements of factuality.
the estimated rate at which metric values change
with the level of factual inconsistency):                        2.2    Testing Factuality Metric Validity
                    PL
                          (i − L̄)(M̄i − M̄ )                    For the purposes of testing boundedness (Condition
    Sensitivity = | i=0PL                     | (2)              I), we define the Lower Bound for a metric M as
                                         2
                            i=0 (i − L̄)
                                                                 M (D, Sr ) where D is the source document and
In Eq. (2), L is the maximum error level, L̄ is the              Sr is a randomly sampled summary from the cor-
average error level, M̄i is the average value for the            pus. We define the Upper Bound for the metric as
metric at error level i and M̄ is the average value              M (D, Sf ), where Sf is the reference ground-truth
for the metric across all error levels.                          summary.
   If a metric is sensitive to changes in factuality, it            To test sensitivity (Condition II), we report sen-
should hold that there is a statistically significant            sitivity score (Eq. 2) and measure whether the
difference between M̄i and M̄i+1 .                               differences between metric results are different for
                                                                 various levels of factual inconsistency, and whether
   Robustness (Condition III). The metric                        these differences are statistically significant. For
should be robust to types of factual errors, i.e.                this test, we measure the correlation (Pearson’s r)
it should be able to capture both intrinsic entity               between the factual inconsistency level of the sum-
errors and other types of factual errors like pronoun            maries (i.e. the number of injected errors) and the
errors. See Table 3 for a list of some factual error             average metric score. Then we measure statistical
types we consider.                                               significance using the p-value from a two-tailed
                                                                 hypothesis test. We check whether metrics satisfy
   Generality (Condition IV). The metric                         robustness and generality (Conditions III and IV)
should be generalizable across domains, i.e. if                  by separately running this analysis over multiple
it satisfies the previously defined conditions on                factual error types and domains/tasks. We measure
a domain A, it is expected that it also satisfies                commonsense by checking the correlation between
these conditions on a domain B. We acknowledge                   factual consistency levels determined using manual
there will likely be corner cases for which this                 annotation and metric values.
is not true for any metric as well as domains for
which factuality itself is difficult to define (story            3     Evaluation Datasets
generation, for example), so we only consider the
domains for which factual consistency evaluation                 We evaluate the factual consistency metrics on two
is most obviously applicable.                                    categories of datasets: (i) available summariza-
                                                                 tion datasets on varying domains, (ii) diagnostic
  Commonsense (Condition V). If human                            datasets that are both simulated to evaluate different
annotators judge a summary Si to be more                         levels of factuality and model generated datasets.
Dataset       Train     Dev       Test     Domain       3.2     Diagnostic Datasets
 XSUM          204,045   11,332    11,334   Short news
 CNNDM         287,227   13,368    11,490   Long news    To test the ability of proposed metrics to fulfill
 SAMSUM        14,732    818       819      Dialogues    our predefined conditions, we set up two diagnos-
                                                         tic datasets consisting of (i) transformed reference
      Table 2: Summarization domains for evaluation.     summaries with simulated factuality errors that al-
                                                         low us to induce and measure factual consistency
                                                         in a controlled setting and (ii) summaries generated
3.1    Summarization Datasets                            by state-of-the-art transformer summarization mod-
                                                         els that allows us to measure the effectiveness of
In this work we consider summarization domains
                                                         metrics in a real data setting.
to cover a broad range of topics, lengths of ground-
truth summaries and levels of abstractiveness. In
particular, we focus on accurately measuring fac-        3.2.1    Simulated Datasets
tuality in the context of news and dialogue sum-
                                                         For each of the considered domains in section 3.1,
marization, which is key for preventing spread of
                                                         we sample 500 source document / reference sum-
misinformation in two different domains. For ex-
                                                         mary pairs. We then inject simulated factual errors
ample, in dialog summarization, it is important
                                                         into the reference summaries based on randomly se-
that a machine-generated summary of an exchange
                                                         lecting entities, including pronoun words, or verbs
between a politician and a reporter at a press confer-
                                                         and adjectives to induce a desired level of factual
ence is factually consistent, and doesn’t hallucinate
                                                         inconsistency. We define the full list of errors we
details about what was said. We considered the
                                                         inject using transformations in Table 3.6
following three summarization domains (see Table
2 for dataset statistics):                                  We notice that some transformations did not pro-
Short News. To test the ability of metrics to mea-       duce any change in the reference summary due to a
sure factuality in the extreme news summarization        lack of lexical features that can be changed (See Ta-
domain, we use the XSUM dataset (Narayan et al.,         ble 1 for the distribution of the entity words, verbs
2018) which contains over 200k BBC news articles         and adjectives). For example, the XSUM reference
paired with 1-sentence summaries.                        summary “Warm, humorous, gutsy, sparky, soul-
                                                         ful, determined and fun.” contains no entities or
Long News. We also test metrics on longer multi-
                                                         verbs that can be transformed. In addition, some
sentence summaries from the CNN/DailyMail
                                                         transformations may have more of an effect on
dataset (Nallapati et al., 2016), which tend to be
                                                         factuality than others (e.g. for the XSUM sum-
more extractive than summaries in XSUM dataset.
                                                         mary “You may know Bob best as the paramedic
Dialogues. In contrast to news, dialogue summa-          Finlay Newton in the BBC’s Casualty,” exchang-
rization resources are relatively scarce. We use the     ing “Idris” for “Bob” would change a smaller ratio
recently released SAMSUM corpus (Gliwa et al.,           of the summary content words than exchanging
2019) to test metrics on dialogue summarization.         “Idris” for “Finlay Newton”). Due to these reasons,
SAMSUM consists of English language conver-              we generate five different versions of each set of
sations written by linguists in the style of chat        our diagnostic data when randomly selecting refer-
messenger dialogues and aligned multi-sentence           ence summary transformations and assessing the
summaries.                                               aggregated results (See Table 4 for the distribution
   Compared to CNN/DailyMail dataset, XSUM               of errors).
is considered more abstractive based on the pro-
                                                            We control the factuality of transformed sum-
portion of the novel n-grams in gold summaries
                                                         maries by setting the maximum number of random
in comparison to the source documents (Narayan
                                                         transformations to 1, 2, or 3 injected errors, rep-
et al., 2018). Compared to structured news docu-
                                                         resenting three different levels of factual inconsis-
ments, the SAMSUM dialog dataset is unstructured
                                                         tency for sensitivity evaluations (Condition II).
and contains chats between varying interlucators
and the text is first-person directed speech, while
                                                             6
the summaries are written in third-person point of             See the Appendix for details. For verb negation, we focus
                                                         on simple negations using “not” (e.g. “I agree” → “I do not
view, which makes them highly abstractive in na-         agree,” rather than more complex negation (e.g. “I agree” →
ture.                                                    “I disagree” or “I agree” → “I beg to differ” ).
Reference                                                   Type             Description                                  Example
 Irish Taoiseach (PM) Leo Varadkar has engaged in                             An entity appearing in the                   Canadian Taoiseach (PM) Leo Varadkar has engaged in
 some “sock diplomacy” in his first meeting with     Intrinsic entity error   source document is used                      some “sock diplomacy” in his first meeting with
 Canadian Prime Minister Justin Trudeau in Dublin.           (int)            incorrectly.                                 Irish Prime Minister Justin Trudeau in Dublin.
 Irish Taoiseach (PM) Leo Varadkar has engaged in                             An entity appearing in                       French Taoiseach (PM) Leo Varadkar has engaged in
 some “sock diplomacy” in his first meeting with     Extrinsic entity error   the candidate summary does                   some “sock diplomacy” in his first meeting with
 Canadian Prime Minister Justin Trudeau in Dublin.           (ext)            not appear in the source document.           Canadian Prime Minister Justin Trudeau in Dublin.
 Irish Taoiseach (PM) Leo Varadkar has engaged in                             A pronoun in the candidate summary           Irish Taoiseach (PM) Leo Varadkar has engaged in
 some “sock diplomacy” in his first meeting with        Pronoun error         is used incorrectly.                         some “sock diplomacy” in her first meeting with
 Canadian Prime Minister Justin Trudeau in Dublin.          (pro)             For example, (her/she instead of him/he).    Canadian Prime Minister Justin Trudeau in Dublin.
 Irish Taoiseach (PM) Leo Varadkar has engaged in                             There are verb negations in                  Irish Taoiseach (PM) Leo Varadkar has not engaged in
 some “sock diplomacy” in his first meeting with        Negation error        the candidate summary that                   some “sock diplomacy” in his first meeting with
 Canadian Prime Minister Justin Trudeau in Dublin.         (verb)             contradict the source document.              Canadian Prime Minister Justin Trudeau in Dublin.
 Irish Taoiseach (PM) Leo Varadkar has engaged in                             An adjective or adverb appearing             Irish Taoiseach (PM) Leo Varadkar will engage in
 People who have been prescribed powerful anxiety      Sentiment error        in the candidate summary                     People who have been prescribed weak anxiety
 or pain relief drugs are being warned about a new          (sent)            contradicts the source document.             or pain relief drugs are being warned about a new
 drug-driving law.                                                                                                         drug-driving law.

                                                          Table 3: Table of possible factual errors.

                  Dataset                             Level 1 Avg.            Level 2 Avg.        Level 3 Avg.               Avg. % Transformed
                                                                                                                                (L1/L2/L3/All)
                  XSUM (Entity)                             0.59                   1.14                    1.61           58.84 / 76.44 / 86.28 / 73.85
                  XSUM (Non-Entity)                         0.48                   0.93                    1.28           48.32 / 74.00 / 85.40 / 69.24
                  CNNDM (Entity)                            0.75                   1.48                    2.17           74.92 / 85.68 / 94.48 / 85.03
                  CNNDM (Non-Entity)                        0.50                   1.05                    1.62           79.44 / 93.32 / 97.04 / 89.93
                  SAMSUM (Entity)                           0.59                   1.16                    1.70           58.96 / 77.32 / 87.56 / 74.61
                  SAMSUM (Non-Entity)                       0.49                   0.91                    1.28           48.52 / 72.80 / 84.12 / 68.48

Table 4: Analysis of simulated diagnostic dataset (we average across 5 different sets (runs) of randomized transformations
for the same 500 reference summaries). We provide results for the average number of induced factuality errors for factual
inconsistency level 1 (L1), level 2 (L2) and level 3 (L3), as well as the percentage (%) of summaries that were transformed for
each level and across all levels (All). We split the diagnostic dataset into two subsets based on whether simulated errors are
related to entities (Entity) or non-entity changes like verb negation (Non-Entity).

3.2.2        Model-Generated Datasets                                                       proving coherency rather than factual consistency
To assess the performance of various metrics on                                             (e.g. BERTScore (Zhang et al., 2020) and BLEURT
actual generated text, we use a version of the                                              (Sellam et al., 2020)), and standard summarization
T5 encoder-decoder summarization model (Raf-                                                evaluation metrics (e.g. ROUGE (Lin, 2004)). The
fel et al., 2019) that was pretrained on news sum-                                          following is the list of metrics we used for factual
marization data and generate summary text using                                             consistency evaluation:
either greedy decoding, beam search or a sample-                                               QA-Based Quality Score. Given a source or
based decoding strategy like top-k (Fan et al., 2018)                                       reference document D and candidate summary Si ,
and Nucleus sampling (Holtzman et al., 2020). We                                            QA-based evaluation metrics assign a generation
conduct a fine-grained human evaluation of fac-                                             quality score to Si to measure the ability of a QA
tuality over generated summaries to assess effec-                                           system by accurately answering questions gener-
tiveness of our sensitivity analysis at highlighting                                        ated from D or Si . We use the SummaQA (Scialom
metric strengths and weaknesses for generated sum-                                          et al., 2019) and FEQA (Durmus et al., 2020) met-
maries (See section 5.2).                                                                   rics. For the SummaQA metric, questions are gen-
                                                                                            erated from the source document D and the candi-
4     Factuality Metrics for Evaluation                                                     date summary Si is used as input to the QA system.
                                                                                            Alternatively, FEQA generates questions from Si
We mainly focus on meta-evaluating most recently                                            and uses D to answer these questions.
proposed factual consistency metrics which use                                                 The generation quality score is typically the ag-
two types of proxy natural language understand-                                             gregated F1 score measuring the similarity between
ing (NLU) objectives aimed at implicitly capturing                                          ground-truth answers for questions generated from
factuality in generated text: question-answering                                            D and the answers predicted by the QA system.
(QA) and a masked token prediction cloze task.                                              SummaQA also generally includes the aggregated
We also measure the factual-awareness of sum-                                               model confidence probabilities for predictions.
marization metrics that are aimed primarily at im-                                             Masked LM Prediction (Cloze Task) Score.
Given a source document D and candidate sum-            that focused specifically on factuality are more sen-
mary Si , Cloze-based evaluation metrics assign         sitive to changes in factuality compared to the stan-
a generation quality score to Si by measuring           dard lexical overlap or contextual semantic simi-
the ability of a NLU system to accurately predict       larity metrics, all of these metrics except BLANC-
masked tokens in the source document, given ac-         Tune, ROUGE-1 and ROUGE-L satisfy the bound-
cess to the information in Si . We use two variants     edness condition (Tables 5 and 6). Additionally,
of BLANC (Vasilyev et al., 2020), BLANC-Help            all metrics except SummaQA confidence scores
and BLANC-Tune. BLANC-Help uses both D and              (SummaQA-C) are sensitive for entity errors on the
Si as input to a pretrained masked token prediction     dialogue dataset (See Table 7). For CNNDM we
model, while BLANC-Tune only uses D as input            find that all the metrics except BLANC-TUNE and
to a model that has been finetuned on the candi-        FEQA are sensitive to factual consistency to some
date summary. Both metrics are aimed at capturing       degree when we consider entity errors (The most
fluency, informativeness and factual correctness of     sensitive metric is SUMMAQA-F1 with a sensi-
summaries.                                              tivity score of 1.02), though the actual sensitivity
   Semantic Similarity. Semantic similarity met-        effect size is very low for ROUGE and BERTScore
rics measure the overlap between contextual em-         (S < .01). For XSUM, SummaQA-F1 has the
beddings of a source or reference document D and        highest sensitvity score (0.34), but only FEQA,
candidate summary Si . We use BERTScore (Zhang          ROUGE-1, ROUGE-2 and BERTScore are nega-
et al., 2020), which has been shown to correlate        tively correlated with factual inconsistency with
better with human judgements of coherency than          p ≤ 0.05. This indicates that when factual incon-
standard summarization metrics and similarly to         sistency of summaries is relatively low, factuality
n-gram metrics on factual consistency of CNNDM          metrics have high variance in terms of effective-
summaries (Wang et al., 2020).                          ness at detecting differences in the levels of factual
   Lexical Overlap. Finally, we test ROUGE (Lin,        inconsistency. (Overall at least 26% of summaries
2004), which is the standard metric used for eval-      are factually correct for XSUM and at least 15%
uating summarization. ROUGE measures the n-             are correct for CNNDM - see Table 4 for details.)
gram overlap between a source or reference docu-           ROUGE is not always a valid factuality met-
ment D and candidate summary Si . We evaluate           ric. Even when we remove the limitations on lexi-
results using ROUGE-1 and ROUGE-2, as well as           cal overlap metrics posed by reference summaries
ROUGE-L, which measures longest common sub-             (Novikova et al., 2017), we find that there is high
sequence overlap. We follow prior work that con-        variation across domains in the performance of
sidered ROUGE in factual consistency evaluations        source document-referenced ROUGE metrics at
(Wang et al., 2020), though it has also been pre-       identifying factual inconsistency. While most other
viously noted that ROUGE can underweight good           metrics fulfil our boundness and sensitivity condi-
summarization examples (Novikova et al., 2017).         tions, ROUGE-1 and ROUGE-L fail to be bounded
                                                        (e.g. XSUM summaries with a factual inconsis-
5     Meta-Analysis of Factuality Metrics
                                                        tency level of 3 have an average ROUGE-1 score of
5.1    Controlled Data Experiments                      10.92 while the upper bound is 10.61) or sensitive
                                                        in the case of non-entity-based errors, with metric
We conduct controlled experiments on the simu-
                                                        values actually increasing as factual inconsistency
lated datasets as introduced in 3.2.1, mainly to
                                                        increases (ROUGE-1 has correlations of 0.98 and
measure the sensitivity of the factuality metrics
                                                        0.96 on XSUM and CNNDM respectively, while
on various simulated factuality errors. We provide
                                                        ROUGE-L has correlations of 1 and 0.91). This
the results of the sensitivity analysis over our sim-
                                                        implies that standard lexical overlap metrics are
ulated data on the XSUM domain in Table 5, on
                                                        able to pick up on obvious lexical errors like those
CNNDM in Table 6 and on SAMSUM in Table 7.
                                                        indicated by entity changes, but are inadequately
All reported results are aggregated from metric val-
                                                        sensitive to subtler changes like those captured by
ues computed over five different sets of random
                                                        verb negation.
transformations (See section 3.2.1 for details).
   Differences between factuality and standard             QA vs. Cloze. While masked token prediction
metrics are fragile when factuality is high. Our        (cloze task) metrics improve over ROUGE when it
results collectively suggest that, while the metrics    comes to detection of non-entity-based errors on
CLOZE                                           QA                                                    STANDARD                            CONTEXTUAL
               BLANC-Help BLANC-Tune              SummaQA-C        SummaQA-F1           FEQA                      R-1            R-2              R-L            BERTScore
 Upper Bound             5.99             1.73              9.64              4.48            27.87              10.61               2.56               9.32               83.76
 Level 1          5.73 / 5.98      1.74 / 1.71       9.44 / 9.44       3.80 / 4.31    23.20 / 26.94      10.49 / 10.76        2.54 / 2.56        9.22 / 9.42       83.53 / 83.56
 Level 2          5.46 / 5.99      1.59 / 1.78       9.27 / 9.35       3.40 / 4.22    20.05 / 26.55      10.40 / 10.86        2.51 / 2.54        9.16 / 9.49       83.36 / 83.38
 Level 3          5.30 / 5.97      1.58 / 1.76       9.16 / 9.23       3.13 / 4.14    15.81 / 26.06      10.33 / 10.92        2.49 / 2.52        9.10 / 9.55       83.21 / 83.26
 Lower Bound             1.67             0.25              8.69              1.40             7.07               6.72               0.01               0.06               80.97
 Sensitivity      0.22 / 0.01      0.08 / 0.02      0.14 / 0.11       0.34 / 0.09    0.04 /
0.10
                                  0.20                                                                                                                                                                                                1.2
                                                                                                                                           0.05
                                  0.15                                                                                                                                                                                                1.0

                                                                                                                                                                                                      Distribution of Metric Values
 Distribution of Metric Values

                                                                                                           Distribution of Metric Values
                                                                                                                                           0.00                                                                                       0.8
                                  0.10                                                                                                     0.05                                                                                       0.6
                                  0.05                                                                                                     0.10                                                                                       0.4
                                                                                                                                                                                                                                      0.2
                                  0.00                                                                                                     0.15
                                                                                                                                                                                                                                      0.0
                                                                                                                                           0.20
                                  0.05                                                                                                                                                                                                0.2
                                                                                                                                           0.25
                                         0                1                2               3                                                      0          1                2           3                                                 0           1                2           3
                                                   Level of Factuality (# of Errors)                                                                  Level of Factuality (# of Errors)                                                          Level of Factuality (# of Errors)

BLANC-Help                                                                                              BLANC-Tune                                                                                                                                 FEQA
(Pearson’s R = 0.15, p = 0.15)                                                                          (Pearson’s R = 0.01, p = 0.89)                                                                                                 (Pearson’s R = -0.07, p = 0.60)

                                 0.20                                                                                                                                                                                          0.88

                                                                                                                                           0.20                                                                                0.86
                                 0.15
 Distribution of Metric Values

                                                                                                                                                                                               Distribution of Metric Values
                                                                                                           Distribution of Metric Values

                                                                                                                                           0.15
                                                                                                                                                                                                                               0.84
                                 0.10                                                                                                      0.10
                                                                                                                                                                                                                               0.82
                                                                                                                                           0.05
                                 0.05
                                                                                                                                                                                                                               0.80
                                                                                                                                           0.00
                                 0.00                                                                                                      0.05                                                                                0.78
                                         0                1               2                3                                                      0          1                2           3                                                 0          1                2            3
                                                  Level of Factuality (# of Errors)                                                                   Level of Factuality (# of Errors)                                                         Level of Factuality (# of Errors)

SummaQA-C                                                                                               SummaQA-F1                                                                            BERTScore
(Pearson’s R = -0.22, p = 0.02)                                                                         (Pearson’s R = -0.22, p = 0.03)                                                       (Pearson’s R = -0.03, p = 0.76)

Figure 2: Distribution of metric values evaluated in this work. The evaluations are on human annotated generated XSUM
dataset sample summaries across factuality levels for three different metric types: cloze task, question-answering and contextual
metrics. The colored bounds indicate the variation across 100 samples. The colors indicate levels of factuality errors. The results
shown are for all errors except “other” and “false quote.”

                                             60                                                Error Types                                                             attack, say doctors,”. Here, the actual implication
                                                                                               (0) - int
                                             50                                                (1) - pro
                                                                                                                                                                       in the ground truth summary is that blood tests can
                                             40
                                                                                               (2) - ext
                                                                                               (3) - verb
                                                                                                                                                                       lead to better diagnosis of heart attacks and less
                                             30
                                                                                               (4) - other                                                             unnecessary hospitalization rather than preventing
                                                                                               (5) - sent
                                             20
                                                                                                (6) - false quote                                                      heart attacks. We find that these types of false quote
                                             10
                                                                                                                                                                       errors appear more frequently in XSUM summaries
                                                                                                                                                                       than any other type of errors except the extrinsic
                                                      0         1         2            3       4       5                             6
                                                                                                                                                                       entity errors (ext).
Figure 3: Distribution of factual error types in T5 generated
XSUM summaries. The factual error types are described in                                                                                                                  Figure 2 shows the distribution of metric values
Table 3. ext: extrinsic entity error, int: intrinsic entity error,                                                                                                     across factuality levels for 100 sampled XSUM
pro:pronoun error, neg: negation error, sent:sentiment error,
false quote: hallucinated quotes                                                                                                                                       summaries (excluding “other” and “false quote”
                                                                                                                                                                       errors). Table 8 lists the correlation and p-values
                                                                                                                                                                       from Figure 2 including the ROUGE scores. We
(as described in Table 3) also appear in the anno-                                                                                                                     find that all of the metrics except Cloze task met-
tated generated summaries. We also discovered a                                                                                                                        rics and ROUGE-2 are negatively correlated, and
new category of error we define for human evalu-                                                                                                                       SummaQA metrics show a statistically significant
ation, “false quote”, which describes an instance                                                                                                                      correlation. This is in line with our findings on
of a hallucinated quote in the generated summary.                                                                                                                      the simulated data (form Table 5), where we found
For example, in a model generated XSUM sum-                                                                                                                            that SummaQA also performed best in terms of
mary is the claim “A blood test may help to save                                                                                                                       sensitivity score (SummaQA metrics were nega-
a man from having heart attacks, says a British                                                                                                                        tively correlated with factual inconsistency with
medical journal,” when the ground-truth summary                                                                                                                        a p-value of 0.07 compared to 0.05 for FEQA),
is “A blood test can more than halve the number of                                                                                                                     and cloze task metrics were not always bounded
people admitted to hospital with a suspected heart                                                                                                                     or sensitive and BLANC-Tune and ROUGE-2 had
Metric               Correlation     p-value      periments run using our Go-Figure Meta Evalua-
       BLANC-Help                   0.15    0.15
                                                         tion:
       BLANC-Tune                   0.01    0.89
       SummaQA-C                   -0.22    0.02*           The simulated data analysis highlights the
       SummaQA-F1                  -0.22    0.03*        same trends in best- and worst-performing fac-
       FEQA                        -0.07    0.60         tuality metrics as human analysis. Metrics with
       R-1                         -0.06    0.57         low sensitivity scores on our simulated XSUM data
       R-2                          0.13    0.20         perform poorly on human-annotated XSUM data,
       R-L                         -0.03    0.74         regardless of correlation with factual consistency
       BERTScore                   -0.03    0.76         on simulated data. Conversely, high sensitivity
                                                         scores on simulated data may be an indicator of
Table 8: Correlation for annotated XSUM generated sum-   better performance on human-annotated data. For
maries (without “other” and “false quote” errors).
                                                         the purposes of determining the most reliable
       Metric               Correlation     p-value
                                                         factuality metric, this suggests that simulated
                                                         data is sufficient given that some metrics (e.g.
       BLANC-Help                  0.10     0.33         SummaQA) have high sensitivity. In case of
       BLANC-Tune                  0.01     0.94
                                                         metrics like BERTScore and ROUGE, variance
       SummaQA-C                  -0.17     0.09
                                                         in performance between simulated and generated
       SummaQA-F1                 -0.20     0.05*
       FEQA                      < 0.01     0.99         data is less predictable due to low sensitivity.
       R-1                         0.02     0.82
       R-2                         0.14     0.15         Analysis on human annotated data is still
       R-L                         0.05     0.64         necessary when evaluating metrics. While
       BERTScore                   0.06     0.53         BLANC-Help metric values decrease with factual
                                                         inconsistency on simulated data, the metric is
Table 9: Correlation for annotated XSUM generated sum-   positively correlated with factual inconsistency
maries (with all error types).
                                                         on generated data. The differences between
                                                         factuality metrics and lexical overlap metrics are
the least correlations on entity-based errors. How-      also more clearcut when considering generated
ever, in our analysis on generated data, we notice       summaries as opposed to transformed reference
a difference in results for FEQA, ROUGE-1 and            summaries. All ROUGE metrics may increase as
ROUGE-L metrics. We find that these metrics are          factual consistency decreases when we consider
sensitive with a score less than 0.05 on simulated       the full set of error types (See Table 9), and while
XSUM data with entity-based errors, but are not          ROUGE-2 is the most promising lexical overlap
significantly sensitive on generated data (though        metric on simulated summaries in our simulated
these metrics are negatively correlated). These find-    experiments, it is also positively correlated with
ings indicate that the magnitude of the sensitivity      factual inconsistency when we remove “other”
score from the simulated data analysis as well as        and “false quote” errors. This emphasizes the
statistical significance may be key to predicting        importance of a human-annotated test set as part of
results on generated data. When we consider all          the Go-Figure meta evaluation.
errors (Table 9), we find relatively less variation in
most of the metric scores between factuality lev-        The effectiveness of factuality metrics is
els for the generated summaries, though there is         most clear when factual consistency is low.
some evidence that SummaQA metric values may             While factuality metrics have higher sensitivity
be weakly correlated with factuality level.              scores than standard lexical overlap or contextual
                                                         metrics, our analyses show that ROUGE-2 and
5.3    Discussion of Meta Evaluation                     BERTScore metric values appear to be correctly
Our analyses show that in contrast to prior work on      correlated with factual consistency score on
factual consistency that mostly concentrated on one      reference summaries transformed with simulated
specific domain and dataset, our Go-Figure frame-        factual inconsistencies. However, these metrics do
work is effective at testing whether performance of      not perform well on generated summaries, where
metrics generalize across domains.                       there is more room for factual inconsistency.
   We highlight the following key points from ex-
Limitations. Even though we define levels              compare correlation of various summarization
of factual inconsistencies, our framework assumes      metrics with human judgements of factuality.
that the correctness of a factual claim is binary,     We expand upon these prior analyses by also
rather than scaled. However, it is possible that       introducing a concretely defined framework for
generated summaries are factually consistent but       evaluating current and future factuality metrics.
unfaithful in meaning because they carry different     In contrast to earlier works, we also consider
implications than ground-truth summaries. For          a broader range of domains (notably dialogue
example the T5 summary “The UK should remain           summarization).
a member of the European Union?” and the
matching ground-truth summary “Should the UK           7   Conclusion
remain a member of the EU?,” both are factually
                                                       We show that our meta-evaluation framework can
consistent and on-topic given the underlying
                                                       be used to effectively evaluate sensitivity and valid-
news article but the slight change in the phrasing
                                                       ity of factual consistency metrics with only refer-
of the question for the T5 generated summary
                                                       ence summaries, rather than requiring computation-
makes it appear to be a leading question rather
                                                       ally intensive testing across summarization model
than the more impartial framing of the original
                                                       variants to identify metric strengths and shortcom-
summary. This relates to subjectivity of generated
                                                       ings. The theoretically grounded nature of our met-
text, including generated misinformation (Zellers
                                                       ric conditions also allows for potential extensions
et al., 2019). Measuring shifts in faithfulness due
                                                       to other use cases and text generation settings like
to subjectivity is not explicitly captured by the
                                                       data-to-text generation.
current conditions of our framework.
                                                          In particular, our findings from application of the
                                                       framework to summarization highlight that current
6   Related Work                                       metrics are capable of capturing obvious lexical
                                                       errors (e.g. entity errors) in summaries, but strug-
Factuality in Summarization. Recent efforts
                                                       gle with errors related to more subtle aspects of
by NLP researchers have drawn attention to the
                                                       semantics (e.g. negation and false quotes). Pro-
issue of factual errors and hallucinations in the
                                                       posed future directions for improving the ability
output of neural summarization models (Cao et al.,
                                                       of metrics to capture a broader spectrum of factual
2018; Massarelli et al., 2019; Zhao et al., 2020).
                                                       inconsistencies include modification of QA metrics
The work of Kryscinski et al. (2019) used similar
                                                       like SUMMAQA and FEQA to use more contextual
simulated data collection to ours for improving
                                                       question generation (QG) systems (e.g. common-
factual consistency of models, though their
                                                       sense QG (Shwartz et al., 2020) that allows for
simulated data is only used for training rather than
                                                       more nuanced fact-checking).
evaluation, while Dusek et al. (2017) introduced
a reference-less model-based generation quality
                                                       Acknowledgments
metric based on adversarial training with simulated
examples. A number of works have highlighted           The authors thank Yichen Jiang and Shiyue Zhang
the effectiveness of QA and cloze task objectives      for feedback on implementation, Hannah Rashkin
for evaluating or improving factuality on specific     and Tom McCoy for help with MSR GPU clusters,
domains (Eyal et al., 2019; Huang et al., 2020).       Rowan Zellers and Elizabeth Clark for pointers to
We aim to evaluate these metrics more broadly.         related work, as well as other members of the UW
                                                       NLP, MSR AI and MSR MSAI communities for
Evaluation Framework.            Prior work con-       helpful comments.
cerning evaluation of automatic metrics for
NLG systems has mainly focused on general
evaluations of output quality or coherence and         References
fluency (Callison-Burch et al., 2007; Graham,          Chris Callison-Burch, Cameron Fordyce, Philipp
2015; Fabbri et al., 2020), rather than factuality.      Koehn, Christof Monz, and Josh Schroeder. 2007.
Recent work has started to explore evaluating            (meta-) evaluation of machine translation. In Pro-
                                                         ceedings of the Second Workshop on Statistical Ma-
factuality and faithfulness in summarization (Falke      chine Translation, pages 136–158, Prague, Czech
et al., 2019b; Goodrich et al., 2019; Celikyilmaz        Republic. Association for Computational Linguis-
et al., 2020). In particular, Maynez et al. (2020)       tics.
Ziqiang Cao, Furu Wei, W. Li, and Sujian Li. 2018.           International Conference on Knowledge Discovery
  Faithful to the original: Fact aware neural abstrac-       & Data Mining.
  tive summarization. In AAAI.
                                                           Yvette Graham. 2015. Re-evaluating automatic sum-
Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao.         marization with BLEU and 192 shades of ROUGE.
  2020. Evaluation of text generation: A survey.             In Proceedings of the 2015 Conference on Empiri-
  ArXiv, abs/2006.14799.                                     cal Methods in Natural Language Processing, pages
                                                             128–137, Lisbon, Portugal. Association for Compu-
Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith.        tational Linguistics.
   2019. Sentence mover’s similarity: Automatic eval-
   uation for multi-sentence texts. In ACL.                Han Guo, Ramakanth Pasunuru, and Mohit Bansal.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A             2018. Soft layer-specific multi-task summarization
  question answering evaluation framework for faith-         with entailment and question generation. In Pro-
  fulness assessment in abstractive summarization. In        ceedings of the 56th Annual Meeting of the Associa-
  Proceedings of the 58th Annual Meeting of the Asso-        tion for Computational Linguistics (Volume 1: Long
  ciation for Computational Linguistics, pages 5055–         Papers), pages 687–697, Melbourne, Australia. As-
  5070, Online. Association for Computational Lin-           sociation for Computational Linguistics.
  guistics.
                                                           T. Hashimoto, Hugh Zhang, and Percy Liang. 2019.
Ondrej Dusek, Jekaterina Novikova, and V. Rieser.             Unifying human and statistical evaluation for natu-
  2017. Referenceless quality estimation for natural          ral language generation. ArXiv, abs/1904.02792.
  language generation. ArXiv, abs/1708.01759.
                                                           Ari Holtzman, Jan Buys, M. Forbes, and Yejin Choi.
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.           2020. The curious case of neural text degeneration.
 Question answering as an automatic evaluation met-          ArXiv, abs/1904.09751.
 ric for news article summarization. In Proceed-
 ings of the 2019 Conference of the North American         Luyang Huang, Lingfei Wu, and Lu Wang. 2020.
 Chapter of the Association for Computational Lin-           Knowledge graph-augmented abstractive summa-
 guistics: Human Language Technologies, Volume 1             rization with semantic-driven cloze reward. In Pro-
 (Long and Short Papers), pages 3938–3948, Min-              ceedings of the 58th Annual Meeting of the Asso-
 neapolis, Minnesota. Association for Computational          ciation for Computational Linguistics, pages 5094–
 Linguistics.                                                5107, Online. Association for Computational Lin-
                                                             guistics.
A. R. Fabbri, Wojciech Kryscinski, B. McCann,
  R. Socher, and D. Radev. 2020.       Summeval:           Wojciech Kryscinski, B. McCann, Caiming Xiong,
  Re-evaluating summarization evaluation. ArXiv,            and R. Socher. 2019. Evaluating the factual con-
  abs/2007.12626.                                           sistency of abstractive text summarization. ArXiv,
                                                            abs/1910.12840.
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
  Utama, Ido Dagan, and Iryna Gurevych. 2019a.             Chin-Yew Lin. 2004. ROUGE: A package for auto-
  Ranking generated summaries by correctness: An in-         matic evaluation of summaries. In Text Summariza-
  teresting but challenging application for natural lan-     tion Branches Out, pages 74–81, Barcelona, Spain.
  guage inference. In ACL.                                   Association for Computational Linguistics.
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
                                                           Yang Liu and Mirella Lapata. 2019. Hierarchical trans-
  Utama, Ido Dagan, and Iryna Gurevych. 2019b.
                                                             formers for multi-document summarization. ACL.
  Ranking generated summaries by correctness: An in-
  teresting but challenging application for natural lan-   Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren, and Ji-
  guage inference. In Proceedings of the 57th Annual         awei Han. 2020. Facet-aware evaluation for extrac-
  Meeting of the Association for Computational Lin-          tive summarization. In ACL.
  guistics, pages 2214–2220, Florence, Italy. Associa-
  tion for Computational Linguistics.                      Luca Massarelli, F. Petroni, Aleksandra Piktus, Myle
Angela Fan, M. Lewis, and Yann Dauphin. 2018.                Ott, Tim Rocktäschel, Vassilis Plachouras, F. Sil-
  Hierarchical neural story generation. ArXiv,               vestri, and S. Riedel. 2019. How decoding strate-
  abs/1805.04833.                                            gies affect the verifiability of generated text. ArXiv,
                                                             abs/1911.03587.
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
  Aleksander Wawer. 2019. Samsum corpus: A                 Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
  human-annotated dialogue dataset for abstractive            Ryan T. McDonald. 2020. On faithfulness and
  summarization. ArXiv, abs/1911.12237.                       factuality in abstractive summarization. ArXiv,
                                                              abs/2005.00661.
B. Goodrich, V. Rao, Mohammad Saleh, and Peter J.
  Liu. 2019. Assessing the factual accuracy of gener-      George A. Miller. 1995. Wordnet: A lexical database
  ated text. Proceedings of the 25th ACM SIGKDD              for english. Commun. ACM, 38(11):39–41.
Ramesh Nallapati, Bowen Zhou, C. D. Santos, Çaglar         Tianyi Zhang, V. Kishore, Felix Wu, K. Weinberger,
  Gülçehre, and B. Xiang. 2016. Abstractive text sum-        and Yoav Artzi. 2020. Bertscore: Evaluating text
  marization using sequence-to-sequence rnns and be-           generation with bert. ArXiv, abs/1904.09675.
  yond. In CoNLL.
                                                            Z. Zhao, Shay B. Cohen, and B. Webber. 2020. Reduc-
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.             ing quantity hallucinations in abstractive summariza-
  2018. Don’t give me the details, just the summary!           tion. ArXiv, abs/2009.13312.
  Topic-aware convolutional neural networks for ex-
  treme summarization. In Proceedings of the 2018           Chenguang Zhu, William Hinthorn, Ruochen Xu,
  Conference on Empirical Methods in Natural Lan-             Qingkai Zeng, Michael Zeng, Xuedong Huang, and
  guage Processing, Brussels, Belgium.                        Meng Jiang. 2020. Boosting factual correctness of
                                                              abstractive summarization.
Jekaterina Novikova, Ondrej Dusek, A. Curry, and Ver-
   ena Rieser. 2017. Why we need new evaluation met-
   rics for nlg. In EMNLP.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
  W. Li, and Peter J. Liu. 2019. Exploring the limits
  of transfer learning with a unified text-to-text trans-
  former. ArXiv, abs/1910.10683.

Sascha Rothe, Shashi Narayan, and A. Severyn. 2019.
  Leveraging pre-trained checkpoints for sequence
  generation tasks. Transactions of the Association for
  Computational Linguistics, 8:264–280.

Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
  wowarski, and Jacopo Staiano. 2019. Answers
  unite! unsupervised metrics for reinforced summa-
  rization models. In EMNLP/IJCNLP.

Thibault Sellam, Dipanjan Das, and Ankur Parikh.
  2020. BLEURT: Learning robust metrics for text
  generation. In Proceedings of the 58th Annual Meet-
  ing of the Association for Computational Linguistics,
  pages 7881–7892, Online. Association for Computa-
  tional Linguistics.

Noam Shazeer and Mitchell Stern. 2018. Adafactor:
  Adaptive learning rates with sublinear memory cost.
  In ICML.

Vered Shwartz, Peter West, Ronan Le Bras, Chan-
  dra Bhagavatula, and Yejin Choi. 2020. Unsuper-
  vised commonsense question answering with self-
  talk. EMNLP.

Oleg V. Vasilyev, Vedant Dharnidharka, and J. Bo-
  hannon. 2020. Fill in the blanc: Human-free
  quality estimation of document summaries. ArXiv,
  abs/2002.09836.

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
  Asking and answering questions to evaluate the fac-
  tual consistency of summaries. In Proceedings of
  the 58th Annual Meeting of the Association for Com-
  putational Linguistics, pages 5008–5020, Online.
  Association for Computational Linguistics.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
  Yonatan Bisk, Ali Farhadi, F. Roesner, and Yejin
  Choi. 2019. Defending against neural fake news. In
  NeurIPS.
A       Appendices                                          If a summary is judged to be factually incorrect,
                                                         annotators are allowed to select the number and
A.1       Simulated Data Transformations
                                                         type of errors they observe using a predefined list
We inject errors into reference summaries by first       of factual errors. A screenshot of the error types
using a part-of-speech tagging model and named           and examples shown in the annotation task is given
entity recognition system (spacy)7 to extract enti-      in Figure 4. Summaries were manually evaluated
ties, verbs, and adjectives from these summaries.        and labeled for factual inconsistency by a graduate
For each named entity, we keep track of the label        student.
type (e.g. ORG, GPE, etc).
   Intrinsic entity errors. To inject intrinsic entity
errors into a summary S, we construct a dictionary
of all unique entities appearing in the source doc-
ument for S only, organized by entity label type.
We then swap a random entity in the reference sum-
mary for a different entity of the same label type in
the constructed dictionary.
   Extrinsic entity errors. For extrinsic entity er-
rors, we use the same dictionary construction for all
unique entities appearing in all the corpus source
documents. To change a random adjective, we
use WordNet (Miller, 1995) to obtain the synsets
for that adjective and swap the adjective for its
antonym.
   Pronoun entity errors. Pronoun errors are in-
troduced with a preset list of commonly used pro-
nouns. We randomly extract a pronoun set (e.g.
she/her) from the text using the preset list and swap
it with another random pronoun set (e.g. he/him).
   Verb Negation. We use a rule-based system for
verb negation based on verb tense, and predict tense
based on the suffix and preceding words.

A.2       T5 Training
We fine-tune the T5-base model trained on news
summaries for each domain using the AdaFactor
optimizer (Shazeer and Stern, 2018) with a learning
rate of 0.001 and a batch size of 8.
A.2.1 Human Annotation Layout
For human annotation of factual consistency in
summaries, we show annotators the source docu-
ment, reference summary and a candidate summary
that should be assessed for factuality. We then ask
a factuality question with three choices:

    • Yes (i.e. the summary is factual)

    • No (i.e. the summary contains factual incon-
      sistencies)

    • Not Sure (i.e. the summary is too incoherent
      to judge)
    7
        https://spacy.io/
Figure 4: Examples of factual errors given in annotation task.
You can also read