Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Page created by Barbara Logan
 
CONTINUE READING
Repairing the Cracked Foundation:
                                               A Survey of Obstacles in Evaluation Practices for Generated Text

                                               Sebastian Gehrmann                      Elizabeth Clark                  Thibault Sellam
                                                                             Google Research
                                                                              New York, NY
                                                               {gehrmann, eaclark, tsellam}@google.com

                                                              Abstract                          with the content of model outputs, for example
                                                                                                if they are not attributable to input information.
                                             Evaluation practices in natural language           These ineffective evaluations lead to overestimates
                                             generation (NLG) have many known flaws,
                                                                                                of model capabilities. Deeper analyses uncover that
                                             but improved evaluation approaches are rarely
arXiv:2202.06935v1 [cs.CL] 14 Feb 2022

                                             widely adopted.       This issue has become        popular models fail even at simple tasks by taking
                                             more urgent, since neural NLG models have          shortcuts, overfitting, hallucinating, and not being
                                             improved to the point where they can often no      in accordance with their communicative goals.
                                             longer be distinguished based on the surface-         Identifying these shortcomings, many recent pa-
                                             level features that older metrics rely on. This
                                                                                                pers critique evaluation techniques or propose new
                                             paper surveys the issues with human and auto-
                                             matic model evaluations and with commonly
                                                                                                ones. But almost none of the suggestions are fol-
                                             used datasets in NLG that have been pointed        lowed or new techniques used. There is an in-
                                             out over the past 20 years. We summarize,          centive mismatch between conducting high-quality
                                             categorize, and discuss how researchers have       evaluations and publishing new models or model-
                                             been addressing these issues and what their        ing techniques. While general-purpose evaluation
                                             findings mean for the current state of model       techniques could lower the barrier of entry for in-
                                             evaluations. Building on those insights, we        corporating evaluation advances into model devel-
                                             lay out a long-term vision for NLG evaluation
                                                                                                opment, their development requires resources that
                                             and propose concrete steps for researchers to
                                             improve their evaluation processes. Finally,       are hard to come by, including model outputs on
                                             we analyze 66 NLG papers from recent NLP           validation and test sets or large quantities of hu-
                                             conferences in how well they already follow        man assessments of such outputs. Moreover, some
                                             these suggestions and identify which areas         issues, like the refinement of datasets, require itera-
                                             require more drastic changes to the status quo.    tive processes where many researchers collaborate.
                                                                                                All this leads to a circular dependency where eval-
                                         1   Introduction                                       uations of generation models can be improved only
                                         There are many issues with the evaluation of mod-      if generation models use better evaluations.
                                         els that generate natural language. For example,          We find that there is a systemic difference be-
                                         datasets are often constructed in a way that pre-      tween selecting the best model and characterizing
                                         vents measuring tail effects of robustness, and they   how good this model really is. Current evalua-
                                         almost exclusively cover English. Most automated       tion techniques focus on the first, while the second
                                         metrics measure only similarity between model out-     is required to detect crucial issues. More empha-
                                         put and references instead of fine-grained quality     sis needs to be put on measuring and reporting
                                         aspects (and even that poorly). Human evaluations      model limitations, rather than focusing on produc-
                                         have a high variance and, due to insufficient docu-    ing the highest performance numbers. To that end,
                                         mentation, rarely produce replicable results.          this paper surveys analyses and critiques of eval-
                                            These issues have become more urgent as the na-     uation approaches (sections 3 and 4) and of com-
                                         ture of models that generate language has changed      monly used NLG datasets (section 5). Drawing
                                         without significant changes to how they are being      on their insights, we describe how researchers de-
                                         evaluated. While evaluation methods can capture        veloping modeling techniques can help to improve
                                         surface-level improvements in text generated by        and subsequently benefit from better evaluations
                                         state-of-the-art models (such as increased fluency)    with methods available today (section 6). Expand-
                                         to some extent, they are ill-suited to detect issues   ing on existing work on model documentation and
Train & Test         Model              Automatic           Human               Qualitative           Quantitative
       Data         Hyperparameters         Metrics          Judgements             Analysis              Analysis

                                 Evaluation

                                                                                            All Others

     Model Developer

Figure 1: Even though the evaluation pipeline of a model is complex, with many steps and potential missteps that
get “funneled” into the final results, it is often seen as a black box with the purpose of generating numbers that
demonstrate superiority over competing approaches. We argue that more attention should be paid to the evaluation
process and that the reporting of evaluation results should focus on the characteristics and limitations of a model.

formal evaluation processes (Mitchell et al., 2019;         2       Background
Ribeiro et al., 2020), we propose releasing evalu-
                                                            While “natural language generation” used to have
ation reports which focus on demonstrating NLG
                                                            a very narrow scope,1 today it is used broadly to
model shortcomings using evaluation suites. These
                                                            refer to the production of natural language in any
reports should apply a complementary set of auto-
                                                            context, and NLG tasks include summarization, ma-
matic metrics, include rigorous human evaluations,
                                                            chine translation, paraphrasing, and story genera-
and be accompanied by data releases that allow for
                                                            tion. For the purpose of this survey, we follow this
re-analysis with improved metrics.
                                                            broader definition, but focus on conditional gen-
   In an analysis of 66 recent EMNLP, INLG, and             eration tasks. We define conditional NLG tasks
ACL papers along 29 dimensions related to our sug-          as those in which a machine learning model can
gestions (section 7), we find that the first steps to-      be trained to maximize a conditional probability
ward an improved evaluation are already frequently          p(y|x) where y is natural language and x is an in-
taken at an average rate of 27%. The analysis un-           put that can be structured data or natural language
covers the dimensions that require more drastic             and which provides information about what should
changes in the NLG community. For example, 84%              be generated.2 The evaluation of conditionally gen-
of papers already report results on multiple datasets       erated text typically involves a comparison to the
and more than 28% point out issues in them, but             input and/or a reference text, neither of which is
we found only a single paper that contributed to the        available in an unconditional generation setting.
dataset documentation, leaving future researchers           The scope of this survey thus includes tasks such
to re-identify those issues. We further highlight           as machine translation, summarization, and data-to-
typical unsupported claims and a need for more              text generation, but excludes language modeling.
consistent data release practices. Following the sug-           1
                                                                 Reiter and Dale (1997) define NLG as the process of
gestions and results, we discuss how incorporating          producing text from structured data and thus, text-to-text or
the suggestions can improve evaluation research,            unconditional generation tasks would not count as NLG.
                                                               2
how the suggestions differ from similar ones made                We omit multimodal tasks like image captioning or
                                                            speech-to-text, as well as those with non-textual output like
for NLU, and how better metrics can benefit model           sign language or audio from the scope of this survey since
development itself (section 8).                             those tasks require vastly different evaluation processes.
In addition, we require in-scope NLG tasks to                metrics, how these metrics are typically evaluated,
have an explicit communicative goal, which needs                what issues are being found, and how newly intro-
to be expressed while also planning the content and             duced metrics may overcome these issues in the
structure of the text and actualizing it in fluent and          future. Since not all evaluation strategies are be-
error-free language (Gehrmann, 2020).3 All these                ing applied to all metrics and not all metrics are
aspects need to be captured in the NLG evaluation,              applied to all possible generation tasks, we can
making it much more challenging than evaluating                 only provide an incomplete insight into the met-
other NLP tasks. For an introduction to NLG be-                 ric×task×evaluation method space. Since there
yond this survey, we point readers to the overview              currently exists no “perfect” metric, we will not
by Gatt and Krahmer (2018) for a deeper discussion              conclude with explicit metric recommendations
of NLG tasks, and to the survey by Celikyilmaz                  but rather try and extract successful metric design
et al. (2020) of the evaluation approaches and sta-             principles alongside a family of evaluations that
tistical methods that are discussed in Sections 3-4.            together may provide a more complete characteri-
   Evaluation approaches for generated text have                zation of a model’s performance.
traditionally been categorized as intrinsic or ex-
trinsic (Jones and Galliers, 1995). Intrinsic ap-               3.1   The Status Quo
proaches evaluate a text by itself, whereas extrinsic
approaches measure how it affects people perform-               Almost all commonly used generation metrics are
ing a given task. Intrinsic evaluations include as-             reference-based: a system output o is compared
sessments by human ratings and by automatic met-                to one or multiple human-produced references,
rics which have gained popularity with the advent               {r1 , . . . , rn }. System outputs that are more similar
of statistical NLG (Langkilde and Knight, 1998),                to the references are deemed better. However, there
which led to the standardization of tasks. While                have been many strategies to measure the similarity.
some work exists that aims to standardize extrin-               The most popular evaluation metrics, BLEU (Pap-
sic evaluations (e.g., Mani et al., 1999; Gehrmann              ineni et al., 2002) and ROUGE (Lin, 2004), along
et al., 2019a), the design space is much larger. As             many others, measure the lexical overlap between
a result, intrinsic approaches dominate academic                o and r in terms of precision and recall of n-grams.
publications; Gkatzia and Mahamood (2015) found                 Variants and parameters control tokenization, stem-
that about 75% of published NLG systems rely                    ming, or balancing of precision and recall. With the
on intrinsic evaluations with the fraction increas-             advent of deep learning, metrics were introduced
ing.4 Since we survey widely used approaches, we                that measure the distributional similarity instead
mostly cover intrinsic evaluations, but stress the              that rely on various ways to measure the distance
importance of task-specific extrinsic evaluations.              between two distributed token and sequence rep-
   As pointed out by Reiter and Belz (2009a), the               resentations. Notable examples from this class of
evaluation meta-evaluations we draw on are most                 metrics are the word mover distance (Kusner et al.,
commonly conducted on summarization and ma-                     2015), which relies on non-contextual word em-
chine translation (MT), but that there is an implicit           beddings, and BERT-S CORE (Zhang et al., 2020),
assumption that findings translate to other tasks. To           which aggregates cosine distances between repre-
avoid this issue, we note the task for each study,              sented tokens in a sequence, among others (Zhao
but, due to a lack of prior findings, are not able to           et al., 2019; Clark et al., 2019; Kane et al., 2020;
cover every NLG task. Taking a cautious approach,               Colombo et al., 2021, inter alia). A related class
we make the worst-case assumption that modes of                 of automatic evaluation are statistical approaches,
failure likely transfer across tasks.                           which focus on the distributions, rather than rep-
                                                                resentations, produced by a model. Saggion et al.
3       Challenges of Automatic Evaluation                      (2010) first demonstrated that distributional differ-
                                                                ences between references and model-outputs can
In this section, we provide an overview of common
                                                                be used as a scoring mechanism. Gehrmann et al.
design principles of (intrinsic) automatic evaluation
                                                                (2019b) showed that these differences exist even
    3
      This requirement excludes most question-answering tasks   for large pretrained models, a fact that was used
since they require generating spans or otherwise non-fluent     by Zellers et al. (2019) to train a classifier that de-
sequences of text.
    4
      Informally surveying recent *CL papers suggests a num-    tects generated text. Hashimoto et al. (2019) used
ber of 90% or higher.                                           the same foundation to combine human and auto-
matic evaluation in capturing the trade-off between            generate questions (Chen et al., 2018; Wang et al.,
sampling diverse outputs and achieving the high-               2020; Durmus et al., 2020; Scialom et al., 2021; Re-
est possible quality. Pillutla et al. (2021) expand            buffel et al., 2021; Honovich et al., 2021; Deutsch
on these insights and a framework by Djolonga                  et al., 2021a, inter alia).
et al. (2020) to compare the human- and model-                    This overview already points to the first issue
distributions by measuring the extent to which they            with the state of metrics research: the metrics listed
diverge. An alternative approach by Thompson and               above, except those targeting machine translation,
Post (2020) uses the probabilities of each model-              are designed to work only on English. A notable ex-
generated token under a paraphrasing model that                ception is a study by Briakou et al. (2021) which as-
uses the human reference as input.                             sesses different learned metrics for formality trans-
   Utilizing existing corpora of human quality judg-           fer and uses multilingual pre-trained models such
ments of generated text, learned metrics are clas-             as XLM-R (Conneau et al., 2020). While auto-
sifiers that emulate these judgments. Some metrics             matic metrics are well-studied, the barrier of entry
move beyond reference-based evaluation and in-                 to developing non-English models is growing.
stead provide quality estimation scores between an
input i and output o. The first metric of this kind            3.2   Similarity to References is a Red Herring
was CLASSY, a logistic regression model for sum-
marization evaluation (Rankel et al., 2012). Newer             Many automatic metrics rely on the assumption
metrics rely on pretrained models, are trained on              that NLG systems outputs that are more similar to
more human ratings, and introduce initialization               the reference(s) are better, a property commonly
and and pretraining schemes (Sellam et al., 2020;              referred to as “human-likeness” in the NLG liter-
Rei et al., 2020; Pu et al., 2021; Wegmann and                 ature (see, e.g., Belz and Gatt (2008)). While the
Nguyen, 2021, inter alia), or focus on specific as-            ability to reproduce a reference text sounds like
pects like the faithfulness of generated text (e.g.,           natural evidence of success, relying entirely on
Kryscinski et al., 2020; Aralikatte et al., 2021).             it for evaluation is misleading—a caveat pointed
Many of these metrics rely on artificially intro-              out by many evaluation researchers. For instance,
duced errors, but Cao et al. (2020) find that moving           Belz and Gatt (2008) investigate the correlation
from artificial to real error detection is challenging,        between lexical overlap metrics (such as BLEU
an issue that Zeng et al. (2021) aim to address by             and ROUGE) and various measures of success in
using adversarial examples instead.                            a Referring Expression Generation context. They
   The metrics mentioned so far operate on text                find that “a system’s ability to produce human-like
directly, but there has also been a long history               outputs may be completely unrelated to its effect
of metrics that generate and use intermedi-                    on human task-performance.”
ate structures. These include accuracy of parse                   One reason for this discrepancy is that similarity-
trees (Bangalore et al., 2000), overlap between                based evaluations reward surface similarity at the
“basic elements” (Hovy et al., 2005),5 automati-               expense of meaning and may be “fooled” by
cally constructed content units (Tauchmann and                 similar-looking, yet semantically different, out-
Mieskes, 2020) using the Pyramid framework                     puts. NLG tasks have an extensive output space
by Nenkova and Passonneau (2004), dependency                   which cannot be captured through a limited num-
parses (Pratapa et al., 2021), or sequence align-              ber of references and, a comparison to references
ment (Deng et al., 2021). A special case of in-                becomes less reliable the more “open-ended” a
termediate structures that recently gained pop-                task is. For that reason, ROUGE underperforms
ularity are question-answering metrics that as-                on non-extractive summaries (Dorr et al., 2005).
sess information-equivalence. Similar to the faith-            The problem is especially poignant when the ref-
fulness classifiers above, these aim to measure                erences themselves are flawed. As Dhingra et al.
whether generated text contains the same informa-              (2019) show, using BLEU and ROUGE is prob-
tion as a source or reference. Instantiations of these         lematic with many table-to-text datasets, because
metrics may blank out entities (Eyal et al., 2019;             there is a mismatch between the information con-
Xie et al., 2021; Scialom et al., 2019), or fully              veyed by the reference texts and that of the input
   5
                                                               table. As a result, model outputs that contain sim-
     ROUGE is a special case of this where basic elements
are fixed size n-grams, but other basic element metrics like   ilar unsupported information are rewarded by the
PARENT (Dhingra et al., 2019) only focus on content words.     metric. Similarly, Freitag et al. (2020) show that
BLEU, METEOR, and BERTS CORE may fail to                  corpus (CNNDM, Hermann et al., 2015; Nallapati
reward good translations when the reference text          et al., 2016) along two measures of content quality
contains artifacts such as “translationese”.              (relevance of the content, and faithfulness) and two
   One may wonder whether the problem still ex-           of linguistic quality (on the sentence- and summary-
ists with learnt or embedding-based metrics, since        level) using raters from Mechanical Turk. Consis-
a more flexible notion of similarity should enable        tent with previous findings, they find that ROUGE
metrics to be less reliant on surface-level features or   does not significantly correlate with either of them.
text artifacts in references. However, this argument      Extending the annotations by three expert judg-
assumes that the set of reference appropriately cov-      ments per data point and extending the analysis to
ers the target domain, and that the metric is flexible    more metrics, Fabbri et al. (2021) find similarly
enough to “generalize” from an incomplete set of          low correlations without significant performance
examples. The current empirical evidence for this         improvements of distributional over lexical sim-
is negative —in section 3.4 we will present several       ilarity metrics. Comparing correlations of these
studies that show that even current metrics break         metrics across shared tasks from the Text Analysis
down with simple adversarial examples (Sai et al.,        Conferences (TAC) and CNN/DM and using a dif-
2021; Kaster et al., 2021).                               ferent annotation scheme, Bhandari et al. (2020b)
                                                          corroborate the very low segment-level correla-
How to Interpret Similarity-Based Metrics?                tions and also find that that no distributional metric
If similarity to the reference is a flawed proxy for      outperforms ROUGE. Reanalyzing the data and
quality, what do automatic metrics tell us? This          addressing issues in the statistical tests, Deutsch
question can be investigated empirically by mea-          et al. (2021b) come to the same conclusion about
suring the correlation between metric scores and          ROUGE, but note the insights should be care-
human annotations. In a survey of such studies by         fully assessed since the data selection strategy for
Reiter (2018) focused on BLEU, he concludes that          annotations, coupled with large confidence inter-
it is useful as a diagnostic tool during the develop-     vals, can lead to false results. Beyond summariza-
ment of MT systems, but not for other tasks and that      tion, Novikova et al. (2017a) note similarly poor
is should not be used at the segment level. More          segment-level correlations for data-to-text datasets.
recently, Kocmi et al. (2021) assess how well auto-          All this shows that it is unclear what the results
matic metrics compute pairwise rankings for MT            of embedding-based and lexical metrics represent,
systems, and recommend using a combination of             and it is questionable whether the numbers they
overlap-based and pretraining-based metrics, con-         produce can be trusted outside a few cases such as
firming the previous findings that metrics may be         MT systems ranking. To better understand their
used to rank MT models at the system-level.               limitations and opportunities, we need large-scale
    Several authors have tried to introduce finer-        corpora of high-quality human annotations, which
grained quality criteria, and attempted to under-         do not yet exist for most NLG tasks.
stand which quality dimensions are captured by
automatic metrics that measure the similarity to          The Myth of the Single Reliable Number If
references. In most cases, there is inconclusive          human-likeness should not be used as proxy mea-
evidence. For instance, Reiter and Belz (2009b)           sure for quality of generated text, what should be
find that these metrics may approximate language          used instead? Analyzing DUC 2004 data (Over and
quality, although with only weak evidence, and            Yen, 2004), where human raters annotated the lan-
that they do not measure content quality at all. In       guage quality and the coverage of a summary, i.e.,
contrast, Stent et al. (2005) evaluate metrics on re-     how well it covered the meaning of the source, Gra-
structured sentences, showing that lexical-overlap        ham (2015) found that there was almost no correla-
based metrics do measure similarity in meaning,           tion between the two measures. However, language
but fail at measuring syntactic correctness. The in-      quality was a precondition for achieving high cover-
consistency between studies and use cases suggests        age, leading to a complex relationship between the
that overlap-based metrics likely measure neither,        two. The lack of correlation between language and
which is confirmed by later studies.                      content quality was also noted by Pitler et al. (2010)
    In a more recent study, Kryscinski et al. (2019) 5-   who find correlations between some evaluation cat-
way annotated system outputs on 100 samples from          egories. These insights, combined with the lack
the test set of the CNN-Dailymail summarization           of strong correlations, suggests that a single num-
ber, as produced by almost all automatic metrics,         shows that evaluating factual truth is (perhaps un-
cannot fully characterize an NLG system. Similar          surprisingly) a complex, ill-defined, and unsolved
points are made by Deutsch and Roth (2021) who            task. Additionally complicating this problem is that
show that many similarity metrics capture the over-       artificially introduced errors rarely match errors of
lap in topics between two summaries much better           real summarization models, which means that met-
than the overlap in their information.                    rics trained on synthetic errors may not generalize
                                                          to real systems (Goyal and Durrett, 2021).
Faithfulness is Not Single Dimensional Either                Researchers have studied the validity of faithful-
An aspect of quality mentioned above and which            ness metrics for other NLG tasks as well. For table-
permeates all of NLG is faithfulness, and much            to-text, Thomson and Reiter (2020) report the per-
recent work has focused on this aspect for abstrac-       formance of an information extraction-based met-
tive summarization. Maynez et al. (2020) state that       ric (Wiseman et al., 2017) given different types of
a model is not faithful if it hallucinates, that is, it   errors, and highlights typically problematic cases
adds information that is not present in the source        such as errors with names and numbers which are
document. They define multiple categories of hal-         not detected by the metric. Taking all these points
lucinations: Intrinsic hallucinations misrepresent        into consideration, we conclude that there is no con-
facts in the input, for example turning a “former         sensus on how best decompose and measure faith-
London mayoral candidate” into a “former London           fulness and that even the best current approaches
mayor”. Extrinsic hallucinations ignore the input         are typically flawed. However, we can also see a
altogether, for example generating “President Sara”       clear benefit to measuring specific aspects of output
in the example above. Not all hallucinations are          quality and thus encourage metric designers to stop
problematic—an extrinsic hallucination can be fac-        treating output quality and in particular faithfulness
tual, and may, in fact, be desirable depending on         like a one-dimensional problem.
the use case. For system evaluation, it is therefore
important to be able to discern between hallucina-        Parameter Choices and Reproducibility De-
tions of different types, which cannot be done by         spite these findings, most publications still use only
producing a single number.                                a single metric to demonstrate improvements over
   Maynez et al. demonstrate that similarity met-         prior systems. For example, 100% of papers in-
rics fail to measure faithfulness. The same failure       troducing new summarization models at *CL con-
is observed by Pagnoni et al. (2021) who introduce        ferences in 2021 use ROUGE and 69% use only
and collect annotations for an alternative typology       ROUGE. It thus warrants a deeper look into how
of factual errors which involves fine-grained cate-       ROUGE and other metrics are used.
gories such as Coreference Error and Out of Arti-            The most commonly reported ROUGE con-
cle Error. In an alternative approach to measuring        figurations are the F1 scores of ROUGE-1, -2,
correlations with human judgments, Gabriel et al.         and -L. This choice was initially popularized by
(2021) inject factual errors in reference summaries,      Rush et al. (2015), who picked a subset of the op-
and checks whether system rankings produced by            tions used in DUC 2004 which also included 3,
metrics correlate with the “level of factuality” of       4, and LW (Over and Yen, 2004). However, this
the transformed sentences, among other proper-            choice was not empirically motivated, and from
ties like a metric’s value range and generalization.      DUC 2005 onwards, the recall scores of ROUGE-
They also identify that standard evaluation met-          2 and ROUGE-SU4 were even used instead (Dang,
rics (e.g., ROUGE-L and ROUGE-1) oftentimes               2006).6 On top of the disconnect between the past
fail at capturing factuality, but identify question-      and present choices, both of them are actually sub-
answering metrics as promising, somewhat contra-          optimal. Rankel et al. (2013) find that rarely used
dicting Maynez et al.. Similarly, Chen et al. (2021)      configurations of ROUGE are outperforming com-
analyze mispredictions on a set of previously an-         monly used one, and in an investigation of all 192
notated summarization corpora (Kryscinski et al.,         ROUGE configurations, Graham (2015) find that
2020; Wang et al., 2020; Falke et al., 2019; Maynez       none of them outperformed BLEU and that best
et al., 2020). The study identifies common error          performance was achieved with the precision vari-
types (e.g., “Numerical inference”) and constructs           6
                                                               Note though that DUC 2005 evaluated query-focused
an adversarial test set with rule-based transforma-       summarization instead of sentence compression which was
tions. The diversity of approaches in the literature      the task studied by Rush et al. (2015).
ant of ROUGE-2. The studies by Kryscinski et al.                incorrect comparisons or inflated scores.
(2019) and Fabbri et al. (2021) evaluate the F1-
variants of multiple ROUGE versions and confirm                 3.3   Do Benchmarks Help?
the suboptimal setting. They find that ROUGE-1,
-2, and -L perform strictly worse than ROUGE-3,                 To develop reliable metrics, it may be helpful to
-4, and -WE-1 across multiple rating dimensions.                develop benchmarks to collect large-scale anno-
                                                                tated evaluation data, which may then be used to
    Beyond using a suboptimal setup, additional
                                                                train better metrics. This has been the approach
parameters are often unclear; the most popular
                                                                in MT for over 15 years (Koehn and Monz, 2006),
Python implementation, for example, uses a dif-
                                                                with metrics shared tasks organized as part of the
ferent list of stopwords compared to the original
                                                                yearly WMT workshop/conference. They have
PERL script,7 but implementation details are rarely
                                                                led to improved human annotation processes and
specified. That means that not only do we rely on a
                                                                metrics evaluation approaches, as well as almost
metric that consistently underperforms others, we
                                                                all the learned metrics listed in section 3.1. As
are not even using it correctly or in a replicable
                                                                part of these shared tasks, Macháček and Bojar
manner. Beyond versioning issues, ROUGE was
                                                                (2014) and Stanojević et al. (2015) used non-expert
initially designed to evaluate English text, and it
                                                                crowdworkers to perform a 5-way comparisons be-
thus uses whitespace tokenization, and and English
                                                                tween systems. However, they point out that 5-way
stemmer and stoplist. Yet, it is commonly applied
                                                                comparisons are challenging to interpret as pair-
to other languages without mentions of the exact
                                                                wise comparisons, which is required to compute
changes to get it to run.
                                                                segment-level Kendall-Tau correlations.
    Similar issues exist in modern frameworks as
well, especially those that utilize pretrained mod-                Addressing this issue, Bojar et al. (2016) ex-
els (Liao et al., 2021). For example, BERT-                     perimented with three measuring techniques: the
 S CORE (Zhang et al., 2020) is reported in many                original 5-way ranking, direct assessments (DA)
recent summarization publications, but the term                 where outputs are evaluated by themselves, and
 BERT-S CORE refers to the methodology instead                  HUME, a method which aggregates scores for se-
of underlying model. To combat the confusion                    mantic units. After promising results, Bojar et al.
between model versions, the library produces a                  (2017) only used DA on a 0-100 scale and HUME.
unique hash, inspired by the S ACRE BLEU frame-                 To compute correlations, DA annotations were con-
work (Post, 2018). Yet, these hashes are often not              verted into relative rankings, called DARR. The
reported or aggregated in incomparable ways.8                   following year also abandoned HUME and fully
                                                                relied on DA (Ma et al., 2018), and embedding-
    Another example of an often unreported de-
                                                                based metrics started strongly outperforming other
sign choice is how to use single-reference met-
                                                                metrics. The 2019 shared task introduced a qual-
rics in multi-reference setups. While ROUGE ex-
                                                                ity estimation task in accordance with the DA data
plicitly describes how to use it in multi-reference
                                                                collection technique, illustrating how the human
tasks,9 most neural metrics do not. For example,
                                                                evaluation techniques can influence the design of
 BLEURT (Sellam et al., 2020) only suggests tak-
                                                                metrics (Ma et al., 2019).
ing the max of multiple scores without discussing
tradeoffs compared to computing the mean.10 All                    However, as metrics and systems improved fur-
these evaluation parameters can have a drastic in-              ther, the DA annotations proved insufficient to iden-
fluence over the validity of scores and can lead to             tify a “best” metric (Mathur et al., 2020), which led
                                                                to another major change to the methodology (Fre-
    7
      The package can be found here. Anecdotally, wrappers      itag et al., 2021b). The latest evaluations thus fol-
around the original implementation can lead to changes of       lowed the suggestion by Freitag et al. (2021a) to use
more than 0.5 points.
    8
      For example, Papers With Code for WMT 2014 en-de          Multidimensional Quality Metrics (MQM, Lom-
compares models on S ACRE BLEU score without hashes.            mel et al., 2014), a fine-grained expert-based an-
    9
      The multi-reference version of ROUGE represents a very    notation approach. The results demonstrate that
generous upper bound in which results can only improve by
adding a reference, never decrease, which can have other        DA is unreliable for high-quality translations, of-
negative implications. Moreover, not all implementations may    ten mistakenly ranking human translations lower
use the originally recommended method.                          than system outputs whereas human translations
   10
      The alternative approach can be seen on the leaderboard
of the ToTTo dataset (Parikh et al., 2020) where the mean of    are correctly identified as better than system out-
multiple BLEURT scores is reported.                             puts in MQM. Surprisingly, metrics correlate much
better with MQM, even those trained on the DA                        This section gives an overview of various re-
annotations.                                                      search efforts that seek to evaluate automatic met-
   Does this mean that focusing on DA was wrong?                  rics experimentally, with each focusing on a spe-
No, without many years of (suboptimal) data col-                  cific aspect of the metric, such as its sensitivity to
lection, we would not have learned metrics, and we                sequence length or to lexical overlap between the
would not know whether DA worked for MT. How-                     candidate and the reference.
ever, the progression also teaches the lesson that                   Perturbation Analysis and Surrogate Models
benchmarks may lead the field down the wrong                      One common methodology is to apply methods
path. A similar argument by Hirschman (1998)                      from the interpretability literature to understand
critiques that benchmark evaluations only take a                  what metrics focus on. In one such study, Kaster
narrow approach and states that evaluation is in-                 et al. (2021) measure to what extent several BERT-
trinsically a cost-benefit trade-off. They further                based metrics correlate with a simple linear model
argue that we should weigh the divergent needs                    based on hand-crafted features. They find that
of stakeholders when designing evaluations, sim-                  these metrics are sensitive to lexical overlap de-
ilar to Ethayarajh and Jurafsky (2020), who ar-                   spite the fact that the initial motivation for distri-
gue that not everyone may derive the same utility                 butional similarity metrics was the over-reliance
from an improvement on a leaderboard. Scott and                   on lexical overlap of BLEU and ROUGE. The
Moore (2007) warn that NLG evaluation shared                      authors craft adversarial examples, and show that
tasks could harm the field, since they may amplify                metrics can be fooled by lexically similar, non-
issues with the data and that benchmarks may lead                 paraphrase sentences. To the same end, Sai et al.
to people to ignore external evaluations, and put                 (2021) conduct a correlation analysis after applying
too much emphasis on metrics that do not measure                  34 perturbations that test the metrics’ sensitivity
what we think they measure, both of which also                    to task-specific criteria (e.g., jumbling word order,
happened. We thus can conclude that benchmarks                    introducing spelling errors for fluency, or chang-
are necessary, but that they need to be self-critical             ing numbers for correctness) using the Checklist
and explore different evaluation approaches.11                    method (Ribeiro et al., 2020). The results of this
                                                                  analysis, which covers 18 criteria across six tasks,
3.4    Auditing and Interpreting Metrics                          indicate that trained metrics tend to do better, but
As seen through the WMT metrics shared tasks,                     tuning towards overall quality across task is a poor
machine learning-based metrics are promising, but                 practice, leading to metrics that evaluate no indi-
a common criticism is that they are not transparent;              vidual aspect correctly. Sai et al. further report
it is often unclear how they operate internally and               that even metrics that score highly are not entirely
whether they can deliver high performance consis-                 robust to simple perturbations, calling for a more
tently across domains, tasks, and systems. Metric                 widespread use of this type of analysis.
developers typically report agreement with human                     Aside from lexical overlap, another aspect of
ratings on specific test subsets filtered on the prop-            text that has been shown to confound metrics is
erty of interest, or they measure the change in a                 length. During the DUC summarization tasks, sys-
metric’s value when perturbing a reference (e.g.,                 tems were restricted to a strict number of output
by shuffling words). The idea to write tests for                  bytes and thus were compared at a given length.
metrics, rather than reporting corpus-wide corre-                 This is no longer the case in modern datasets, but
lations, may partly be traced back to Lin and Och                 Sun et al. (2019) show that this can have dire con-
(2004), who pose that metrics should always rank                  sequences. Specifically, up to a certain length, one
a human-produced reference first when compared                    can “cheat” ROUGE scores by simply generating
to multiple system outputs and thus measure how                   longer outputs. Even when the longer outputs are
far the reference deviates from the first spot.12                 qualitatively worse, scores increase.
   11
      We also note that, in addition to DUC/TAC, there has        Impact of the Systems’ Quality As models im-
been a long history of shared tasks in the NLG community
addressing a much more diverse set of tasks starting with         prove, so should metrics. Yet, many metrics are
referring expression generation (Gatt et al., 2008), but which    tuned or benchmarked using previously published
have also covered tasks such as summarization (Syed et al.,       system outputs, which cannot be representative of
2019) and data-to-text generation (Dusek et al., 2020).
   12
      As we discuss later, this strong assumption is rarely met   the current and future state-of-the-art. As a re-
for NLG datasets.                                                 sult of this, Peyrard (2019) find that summariza-
tion metrics with previously reported high correla-      non-English ones, may be used to train future met-
tions with humans disagree with one another when         rics, feeding the positive feedback loop that ties
tasked to compare high quality summaries, reveal-        metrics, models, and human evaluation.
ing their fragility. Bhandari et al. (2020a) revis-
its this conclusion, demonstrating that metrics dis-     4     Challenges of Human Evaluation
agree whenever the quality range is narrow, regard-
                                                         The work presented in the previous section con-
less of whether the summaries are good or bad.
                                                         cludes human evaluation is a necessary component
Bhandari et al. (2020b) also highlight that previ-
                                                         of model evaluations since we cannot trust auto-
ously published studies of metrics would yield dif-
                                                         matic metrics. This conclusion is reached by treat-
ferent conclusions with more recent datasets and
                                                         ing human evaluation annotations as the ground
top scoring systems, and that the relative perfor-
                                                         truth to which automatic metrics are compared,
mance of metrics vary a lot across datasets. These
                                                         and human annotations are also used as training
studies show that it is still unclear how metrics gen-
                                                         corpora for automatic metrics. We thus rely on hu-
eralize across time, systems, and datasets and the
                                                         man evaluations and often treat them as a panacea
evaluation of such qualities is complicated due to
                                                         that reveals the ultimate truth about NLG system
the cost of collecting human annotations, the low
                                                         performance. Yet there are deep-running issues
diversity of existing datasets, and the impossibility
                                                         with how human evaluations are conducted, which
to to access future systems.
                                                         affect these system analyses, metric evaluations,
                                                         and newly developed metrics.
3.5   Takeaways for Metric Developers
Since BLEU was introduced, dozens of papers              4.1    What is Measured?
have shown that automatic metrics have poor corre-       While some work asks evaluators to rate the overall
lations with human judgments of quality (in addi-        quality of generated text, it is more common to
tion to those cited above, see, e.g., Callison-Burch     collect evaluations for specific dimensions of text
et al. (2006)). We challenge the premise that such       quality. However, there is little consensus on which
a correlation would be desirable, because quality is     dimensions to evaluate.
a vastly under-defined property. Instead, we make           In the human evaluations analyzed in Howcroft
the case for multi-dimensional evaluation. This is       et al. (2020)’s study of 165 NLG papers, generated
already common in human evaluations; researchers         text was evaluated along 204 dimensions of quality,
often collect evaluations for several aspects of a       which they mapped to 71 distinct criteria. Some of
generated text’s quality (e.g., in MT, rating both the   these criteria are hierarchical, e.g., grammaticality
fluency and adequacy of a translated text). Since a      and spelling fall under the more general correct-
single number cannot give an accurate depiction of       ness of surface form criterion. There are also cases
system’s performance, we call for the development        where researchers apply the same text quality di-
of metrics with a smaller, but better defined scopes.    mension differently. For example, Howcroft et al.
   Another aspect that does require more attention       (2020) found that what researchers called fluency
is robustness. Meta-evaluation studies have shown        could actually be divided into 15 different criteria,
that metrics can behave vastly differently on dif-       depending on how the term was defined and used
ferent datasets and when tasked to evaluate differ-      in the context of the task.
ent NLG systems. Furthermore, multiple studies              The disparities in how text quality dimensions
demonstrate that automatic metrics easily break          are applied and defined in human evaluations com-
when the input is subject to simple perturbations.       plicate comparisons across efforts and benchmark-
This shows that there is major headroom for im-          ing improvements over previous work. This prob-
provement: the metrics should be narrower in the         lem is exacerbated by the lack of human evaluation
phenomenon they try to capture, but broader in the       details in NLG papers. Of the 478 quality evalu-
input domain on which they perform well.                 ation questions studied by Howcroft et al. (2020),
   Given the results reported on existing bench-         over 50% did not define the criterion they were
marks, we support the view that human evalu-             evaluating for (279 out of 478), 65% did not re-
ation remains an essential component of perfor-          port the exact question they gave the evaluators
mance analysis, complementary to automatic met-          (311/478), and 20% did not even name the crite-
rics. In addition, collected annotations, especially     rion being evaluated (98/478). To promote more
standardized human evaluations, some researchers        the relative quality of the generation models also
have proposed detailed definitions and methodolo-       makes a difference, showing significant differences
gies for human evaluation for a specific task and/or    between older annotations and newly collected hu-
dimension of text quality. For example, Thomson         man judgments for better models.13 They show
and Reiter (2020) propose a methodology for eval-       that automatic metrics trained on annotations of
uating accuracy for data-to-text generation tasks,      text generated from older models do not always
and Rashkin et al. (2021) define a framework for        perform as well when evaluating state-of-the-art
evaluating whether generated text is attributable to    generated text. Another confounder, which we
identified sources.                                     point out in section 3, is the correlation between
   While general or vague evaluation criteria can       dimensions that should not be correlated. Dusek
lower the reproducibility and lead to low agreement     et al. (2020) demonstrate that the correlation can
between evaluators, well-specified human evalua-        be avoided by running different annotation tasks in
tion comes at a cost. For example, the human eval-      parallel, but this leads to a much higher cost to the
uation protocol used in the accuracy shared task at     evaluators.
INLG 2021 (Reiter and Thomson, 2020; Thomson
                                                        Measurement instruments van der Lee et al.
and Reiter, 2020) produced high inter-annotator
                                                        (2021) find that Likert scales were the most popular
agreement, but Thomson and Reiter (2021) re-
                                                        method for rating generated text, used in 56% of
ported that each 300-word text took an annotator
                                                        studies (82/147). However, Belz and Kow (2010)
20-30 minutes to evaluate and the annotation cost
                                                        argue that rating scales like those used in direct
for a single generated text was about US$30. How-
                                                        assessments (i.e., evaluating a generated text alone,
ever, this detailed human evaluation protocol cap-
                                                        without referencing other candidates) have many
tured error categories that the automatic metrics
                                                        issues: they are unintuitive, agreement numbers
were unable to detect.
                                                        are low, and most statistical measures are inappro-
                                                        priate for ordinal data. They find that these issues
4.2   How is it Measured?
                                                        can be addressed to some extent by switching to
Previous work indicates that the way questions are      preferential judgments. Kiritchenko and Moham-
framed, the types of text that are being evaluated,     mad (2017) demonstrated that best-worst scaling
and the measurement instruments can affect the          (asking evaluators to choose the best and the worst
results of human evaluations. Schoch et al. (2020)      items in a set) is an efficient and reliable method
discuss the role cognitive biases can play in the way   for collecting annotations, and this approach has
researchers elicit human evaluations, such as using     been used to collect comparative evaluations of gen-
positive or negative framing (e.g., How much more       erated text (e.g., Liu and Lapata, 2019; Amplayo
fluent is sentence A vs. sentence B?), including text   et al., 2021).
artifacts or study design details that reveal the re-      Belz and Kow (2011) further compare continu-
searchers’ hypothesis, and framing instructions and     ous and discrete rating scales and found that both
questions around a model’s known strengths and          lead to similar results, but raters preferred contin-
weaknesses. Choi and Pak (2005) provide a longer        uous scales, consistent with prior findings (Svens-
catalogue covering 48 of these biases. However, if      son, 2000).14 Contrary to these findings, Bojar et al.
researchers do not report the details of their stud-    (2016) and Novikova et al. (2018) compare direct
ies, no one can judge whether any of these biases       assessments and relative rankings and find that the
would apply; surveys of NLG papers find as few          rankings produced were very similar, but Novikova
as 35% (Howcroft et al., 2020) and 16% (Schoch          et al. conclude that relative rankings are best when
et al., 2020) of papers share the questions used in     combined with magnitude estimates. They also
their human evaluations.                                find that collecting judgments in separate tasks
   Aspects of the texts themselves may also un-         decorrelates different evaluation criteria, albeit at a
duly affect the evaluators’ judgments. For example,     higher cost since multiple tasks have to be run.
Sun et al. (2019) find that several dimensions of          13
                                                              However, this finding may be confounded by the collec-
summary quality (e.g., informativeness) are corre-      tion approach as well (Shapira et al., 2019).
                                                           14
lated with the summary’s length and thus suggest              One potential caveat is that these studies were conducted
                                                        before the wide availability of crowdsourcing platforms and
normalizing for summary length when evaluating          are thus conducted with small cohorts of raters who have a
these criteria. Bhandari et al. (2020b) find that       different motivation.
4.3   Statistical Significance                              Aside from the parameters of the study, there are
                                                         also confounding factors in the evaluation of the
Human evaluations present yet another issue: how
                                                         annotation quality itself. To demonstrate that the
to measure the significance of human evaluation
                                                         annotations are of sufficient quality, reporting inter-
results? van der Lee et al. (2021)’s survey finds that
                                                         annotator agreement is the most common method.
only 23% of NLG papers report statistical analyses
                                                         However, Amidei et al. (2019a) survey 10 years
to determine the significance of their results, and
                                                         of annotation agreement measures and show that
only 13% explicitly state their hypotheses.
                                                         almost all studies fail reliability tests. They argue
   One challenge when testing for significance in        that a substantial amount of the variability cannot
human evaluation results is small sample sizes;          and should not be eliminated since evaluation of
given that the median number of generated texts          generated text is intrinsically subjective and relies
in a human evaluation is 100 items (van der Lee          on many different factors including rater experi-
et al., 2021), most typical experimental designs for     ence, motivation, knowledge, or education. As a
human rating studies will be underpowered to de-         remedy, they suggest using additional correlation
tect small model differences. This problem is not        measures alongside kappa statistics.
specific to NLG. Card et al. (2020) analyze popular
NLP datasets and find that they are not adequately       4.4   Who is Measuring?
powered (e.g., a typical MT test set of 2000 sen-
tences would have approximately 75% power to             In many human evaluations, a small number of
detect differences of 1 BLEU point). Howcroft and        evaluators judge the generated text. 39% of papers
Rieser (2021) demonstrate that treating ordinal data     in van der Lee et al. (2021)’s survey use between 1–
as interval data makes tests even more underpow-         5 evaluators. However, it is becoming increasingly
ered, which is what most papers do when analyzing        common to collect judgments from a large num-
rating and Likert scales (68 out of 85 recent papers,    ber of evaluators using crowdsourcing platforms
according to Amidei et al. (2019b)). Significance        like Amazon Mechanical Turk (MTurk), Appen,
thresholds are not always adjusted when running          Prolific Academic, and Upwork.
multiple significance tests (e.g., Bonferroni correc-       In particular, MTurk has a long history in
tion), increasing the likelihood of false positives      NLP with early claims stating that a small num-
(van der Lee et al., 2019).                              ber of crowdworkers can replace a single expert
   Improvements in NLG models also make detect-          rater (Snow et al., 2008). Similar claims were
ing statistically significant differences more chal-     made in other communities, stating that, while not
lenging. Text generated by high quality models           as high-quality, overall data quality can actually
may differ less often or in more subtle ways, which      be improved by having more redundant annota-
requires more human judgments to detect. Wei and         tions (Sheng et al., 2008). However, later studies
Jia (2021) show that the requirement for more judg-      find that this point is actually a lot more nuanced.
ments can quickly becomes prohibitive: to detect         Some dimensions of text quality may be easier
a difference of 1 point on a 1-100 scale in WMT,         than others to rate with crowdsourced evaluators
we need 10,000 perfect annotator judgments. As           instead of experts. Gillick and Liu (2010) find that
a result, they suggest that automatic metrics may        MTurk judges were better at measuring generated
actually be more reliable than human annotations         summaries’ linguistic quality than their content or
if the annotations are insufficiently powered. The       overall quality and had a much higher correlation
number of required annotations can potentially be        between linguistic and overall quality than experts.
decreased by not uniformly sampling examples to          Clark et al. (2021) find MTurk evaluators are more
annotate and instead biasing the sampling toward         likely to base judgments of generated text on the
those where models differ. However, this process         text’s form rather than its content. In their work on
can lead to artificially high correlation of the re-     German summarization evaluation, Iskender et al.
sults with automatic metrics, which could overstate      (2020) find that non-redundancy and usefulness are
their effectiveness and the quality of human anno-       very hard to assess using crowdworkers and suggest
tations (Deutsch et al., 2021b). Moreover, since         that experts should be used for them, while crowd-
NLG models may only differ in very few exam-             workers are suitable for other dimensions of text
ples, statistical analyses should also handle ties as    quality as long as results are carefully interpreted.
discussed by Dras (2015) for pairwise rankings.             Analyzing DUC annotations between 2001 and
2004, Harman and Over (2004) find that averaged         is payment; does the low-pay, small-batch format
human ratings can yield meaningful insights, but        of crowdsourcing actually provide evaluators with a
also note that there is very high variance both         fair wage? Fort et al. (2011) discuss the low wages
within and between human raters and that it is          MTurk workers receive, along with concerns about
unclear whether the source of the variance is in-       data quality issues that the platform incentivizes.
trinsic to the humans or the models. This variance      These concerns are not unique to MTurk; Schmidt
may be even higher in crowdsourcing scenarios           (2013) argues that there are ethical concerns across
compared to expert raters. Karpinska et al. (2021)      crowdsourcing platforms, regardless of how they
report that running the same MTurk evaluation on        incentivize workers. Shmueli et al. (2021) cover
different days of the week can vary enough to pro-      a broader set of ethical considerations for crowd-
duce different results. When analyzing evaluations      sourcing work, including potential psychological
of MT systems, Freitag et al. (2021a) find that         harms, exposing sensitive information about work-
agreement between ratings produced by linguists         ers, and breaching workers’ anonymity. Despite
and those from crowdworkers can be extremely            these concerns, Shmueli et al. report that only 14
low. In fact, they find that automatic metrics can      out of 703 NLP papers that used crowdsourcing
have higher agreement with high-quality anno-           mention IRB review.
tations than human crowdworkers. Some tasks
like multi-document summarization are especially        4.5   Subjectivity and User Satisfaction
challenging and time-consuming for people to eval-      Most of the human evaluations in this section are
uate. Observations like these have led to work          intrinsic evaluations, asking evaluators to rate the
proposing evaluation methods that combine the ad-       quality of the generated text. However, the more
vantages of human and automatic evaluation (e.g.,       valuable question is answered with extrinsic eval-
Hashimoto et al., 2019; Zhang and Bansal, 2021).        uation: how well does the generated text serve
   The increasing quality of generated text has led     its intended purpose? These evaluations measure
some researchers to move away from crowdsourc-          how useful a text generation model is and indicate
ing platforms. For example, expert evaluators like      whether real world users would be satisfied with
English teachers (Karpinska et al., 2021) or trained,   the generated texts. Evaluations focused on intrin-
in-person evaluators (Ippolito et al., 2020) were       sic qualities of the text fail to capture dimensions
needed to distinguish between human-authored            of NLG systems that practitioners care about, e.g.,
text and text generated by today’s generation mod-      how trustworthy a generated text is or how well it
els (an evaluation most commonly found in dia-          performs in human-in-the-loop settings.15
logue generation). Similarly, Freitag et al. (2021a)       Another related aspect that is rarely considered
demonstrate that non-expert annotations often           in human evaluations is the subjectivity of text
lead to mistaken claims of super-human model            evaluation. People may value certain text quali-
performance, when expert annotators correctly           ties more highly than others or be working from
identify issues in the generated texts.                 a different point of reference. Even the more “ob-
                                                        jective” aspects of text quality, like grammatical
   It is unclear whether these issues are specific to
                                                        correctness, may depend on the evaluators’ dialect,
the fact that non-expert annotators are being used,
                                                        the perceived formality of the text, the context or
or if these issues may be overcome by improving
                                                        style of the generated text, etc. Disagreement in
the quality of the study and the working condition
                                                        evaluators’ ratings does not always indicate eval-
of raters. Investigating the use of MTurk for NLP,
                                                        uator error; rather it may be a signal that there is
Huynh et al. (2021) find that about 25% of studies
                                                        more complexity to the text or dimension of qual-
have technical issues, 28% have flawed, vague, or
                                                        ity. While it has been shown that increasing the
insufficient instructions, and 26% of study creators
                                                        number of annotations per example can decrease
were rated as having poor communication. Notably,
                                                        the overall bias (Artstein and Poesio, 2009), this
they also find that 35% of requesters pay poorly or
                                                        finding assumes that the population of annotators is
very badly according to MTurk raters. To that end,
                                                        somehow representative of the whole world. Prab-
many have questioned whether the treatment eval-
                                                        hakaran et al. (2021) find that aggregating annota-
uators receive and the structure of crowdsourcing
platforms provide ethical working conditions for          15
                                                             See, for example, Ehud Reiter’s summary of a panel on
evaluators. The most basic of these considerations      NLG in industry at INLG 2021.
tor responses results in under-representation of           gue that choosing to evaluate on a dataset reinforces
groups of annotators’ opinions, and they recom-            design decisions taken during its construction and
mend releasing annotator-level annotations and col-        focuses the evaluation on the specific distributions
lecting annotators’ socio-demographic information          represented in the data.
to prevent the exclusion of minority perspectives.            Collectively, the research community could se-
We thus should be careful of results such as those         lect for a more diverse language representation and
that suggest excluding data with low agreement             decide to replace older flawed datasets by newly de-
scores with other annotators (Owczarzak et al.,            veloped ones. Unfortunately, the collective choices
2012), unless we know the source of the disagree-          also reinforce suboptimal design decisions. Analyz-
ment is not subjectivity. Even well-established            ing a sample of 20 papers that proposed summariza-
NLG tasks have aspects of subjectivity that are            tion approaches in 2021, we find 27 datasets that
usually ignored. For example, the goal of a sum-           models were being evaluated on. The most popular
marization task is to generate the important points        ones, CNN/DM and XSum (Narayan et al., 2018),
from a document, but Kryscinski et al. (2019) find         were used five and four times respectively, despite
that when annotators select which sentences in a           their issues, which we explore in section 5.2. Ad-
document are the most important to include in a            ditionally, only two of the 27 datasets were non-
summary, the majority of evaluators only agree on          English, despite much recent work that introduces
an average of 0.6 sentences per document.                  multilingual summarization corpora (Giannakopou-
   While the majority of evaluation criteria is by         los et al., 2015; Scialom et al., 2020; Ladhak et al.,
definition subjective, there is an opportunity for         2020; Hasan et al., 2021; Perez-Beltrachini and
hybrid approaches with the help of standardized            Lapata, 2021).
measures (van der Lee et al., 2021). One such                 These findings lead to three questions. First,
dimension that could be useful for tasks like sim-         how can we as a research field measure summa-
plification is the readability of text, which could be     rization improvements on disjoint datasets? How
measured using scales such as the ones proposed            can we claim that we are making progress if we
by Kincaid et al. (1975) or Ambati et al. (2016).          only focus on a single language? And, given the
van der Lee et al. point out that the relationship         significant issues with popular benchmark datasets,
between these objective measures and subjective            what do improvements even mean? Throughout
readability assessments is not currently being stud-       this section, we analyze typical design choices dur-
ied, although a strong objective measure could lead        ing NLG data construction and how they influence
to a higher degree of standardization. Similarly,          insights derived from evaluations.16
one can imagine human-in-the-loop approaches for
measuring faithfulness that focus on claims that           5.1    Representation in Performance Numbers
are challenging to verify using only automatic ap-         Dataset creation is a value-laden process, yet those
proaches, enabling the collection of a much larger         values are rarely made explicit (Hutchinson et al.,
quantity of judgments.                                     2021). The choices of dataset creators have signif-
                                                           icant impact, for example on who is represented
5   Challenges with Datasets                               in the data and on the language(s) of a dataset.
                                                           Joshi et al. (2020) assess the language diversity
A component mostly kept apart from evaluation              in NLP, showing that very few languages beyond
analyses is the data, even though NLG tasks are            English are being studied, regardless of the num-
embodied through datasets; for example, claims             ber of their speakers. A similar argument can be
about performance on CNN/DM may be used as a               made for dialects; focusing on African American
proxy for performance on all summarization tasks.          Vernacular English (AAVE), Blodgett et al. (2020)
Issues with datasets are widely studied in the gen-        describe multiple studies showing a drop in per-
eral machine learning literature which we heavily          formance on popular NLU tasks when applied to
draw on in this section, with anecdotal evidence for         16
                                                                We point to Paullada et al. (2020) for a more in-depth
NLG tasks when available. In a recent survey of            survey of general issues in data creation, including those of
datasets and benchmarks in machine learning, Liao          benchmarking and data maintenance practices, to Bender et al.
et al. (2021) point out that the lack of differentiation   (2021) for a survey issues of using large web-scraped datasets,
                                                           and to Luccioni and Viviano (2021) and Dodge et al. (2021)
between tasks and datasets that aim to capture them        for analyses of such large-scale web-scraped corpora and their
can lead to harmful over-generalization. They ar-          representational, legal, consent, and PII issues.
You can also read