When Combating Hype, Proceed with Caution - NYU

Page created by Marc Gonzales
 
CONTINUE READING
When Combating Hype, Proceed with Caution

                                            Samuel R. Bowman
                                            New York University
                                            bowman@nyu.edu

                     Abstract

    In an effort to avoid reinforcing widespread
    hype about the capabilities of state-of-the-art
    language technology, researchers have devel-
    oped practices in framing and citation that
    serve to deemphasize the field’s successes.
    Though well-meaning, these practices often
    yield misleading or even false claims about          Figure 1: Hype is a problem. The opposite of hype isn’t
    the limits of our best technology. This is a         necessarily better.
    problem, and it may be more serious than it
    looks: It limits our ability to mitigate short-
    term harms from NLP deployments and it lim-          Kiela et al., 2021; Bowman and Dahl, 2021). While
    its our ability to prepare for the potentially       we have only a limited ability to control the pub-
    enormous impacts of more distant future ad-          lic narrative taking place through industry PR and
    vances. This paper urges researchers to be           the media, there’s reason to be hopeful that we re-
    careful about these claims and suggests some         searchers are getting much better at avoiding the
    research directions and communication strate-        worst forms of overconfidence about our systems.
    gies that will make it easier to avoid or rebut
    them.
                                                            Less fortunately, these discouraging results seem
                                                         to have produced a trend toward pessimism about
1   Introduction                                         model performance that is often ungrounded from
                                                         real empirical results. This leaves room for the
Over the last few years, natural language process-       research community’s consensus about a model’s
ing has seen a wave of surprising negative re-           capabilities to fall short of its actual capabilities.
sults overturning previously-reported success sto-          This is more dangerous than it might seem. It
ries about what our models can do and showing that       risks our credibility and thereby limits our ability
widely-used models are surprisingly brittle (Jia and     to influence stakeholders in cases where our cur-
Liang, 2017; Niven and Kao, 2019; McCoy et al.,          rent systems are doing real harm. It also limits our
2019). This shows that many of our standard prac-        ability to accurately forecast and plan for the im-
tices for evaluation and reporting often lead to unre-   pacts that may result from the deployment of more
alistically positive initial claims about what we can    capable systems in the future. If we can truly reach
do. The resulting hype and overclaiming, whether         near-human-level performance on many of the core
intentional or not, is a problem. Hype and over-         problems of NLP, we should expect these impacts
claiming can encourage the reckless deployment           to be enormous, and potentially catastrophic if not
of NLP systems in high-stakes settings where they        planned for.
can do significant harm. They also threaten the             I call this issue underclaiming, for lack of a bet-
health and credibility of NLP as a research field,       ter term.1 In this paper, I lay out four types of
and thereby threaten our ability to influence applied    underclaiming, focusing especially on writing and
stakeholders or attract funding.                         citation practices in NLP. I then argue that this
   Fortunately, these results have led to a surge of
                                                             1
research and writing that proposes more thorough               While overclaiming generally refers to overstating the
                                                         effectiveness of one’s own methods or ideas, the phenomenon
and cautious practices for the evaluation of model       that I call underclaiming often involves downplaying the ef-
ability (Ribeiro et al., 2020; Gardner et al., 2020;     fectiveness of preexisting methods or ideas.
Model                  Year   SQuAD      AS    AOS
                                                            ReasoNet Ensemble     2017         81    39      50
                                                            BERT-Base             2018         87    64      72
                                                            XLNet-Base            2019         89    69      77

                                                          Table 1: F1 results for the best-performing SQuAD
                                                          model studied by Jia and Liang—ReasoNet (Shen et al.,
                                                          2017)—and for thenewer BERT and XLNet models
                                                          (Devlin et al., 2019; Yang et al., 2019) reported by Zhou
                                                          et al. (2020) on the original SQuAD development set
                                                          and the two Jia and Liang adversarial evaluation sets.
Figure 2: Jia and Liang (2017) remains widely cited ac-
                                                          While I am not aware of results from more recent mod-
cording to Google Scholar. The original work pointed
                                                          els on the this data, progress through 2019 had already
out major unexpected limitations in neural networks
                                                          cut error rates in half.
trained from scratch on the SQuAD reading compre-
hension task. However, many of these citing works
make the groundless implication that modern pre-          this, discussions about the limitations of systems
trained systems—developed more recently than 2017—        from past years don’t straightforwardly apply to
show these same limitations.
                                                          present systems. The first two cases that I present
                                                          involve failures to contextualize claims about the
is a problem: It hurts the short-term interests of        failures of weaker past systems:
NLP researchers, it limits our ability to mitigate        Adversarial Examples for SQuAD Jia and
the current real-world harms of language technol-         Liang (2017) published one of the first prominent
ogy deployments, and it limits our ability to ade-        demonstrations of adversarial failures in neural-
quately prepare for longer-term harms involving           network-based NLP, showing that a simple algo-
economic disruption or catastrophic alignment fail-       rithm could automatically augment examples from
ures. I close by sketching some ways of reducing          the SQuAD benchmark (Rajpurkar et al., 2016) in a
the prevalence of this kind of underclaiming, in-         way that would fool many state-of-the-art systems,
cluding straightforward best practices in writing         but not humans. This work prompted a wave of
and evaluation, a proposed new rule of thumb for          much-needed analysis and a corresponding lower-
writing and reviewing, improvements to tooling for        ing of expectations about the effectiveness of neural
analysis and benchmarking, and as more ambitious          network methods.
research agendas in model performance forecasting            However, citations to this work have since taken
and test set design.                                      on a strange role in the literature. In particular, the
                                                          results in this paper predate the development of
2     Underclaiming: Case Studies                         modern pretraining methods in NLP (Peters et al.,
This paper addresses the possibility that in our ef-      2018; Radford et al., 2018; Devlin et al., 2019).
fort to avoid overconfidence, we have started to err      The best systems studied in Jia and Liang have
in the opposite direction: Published work increas-        more than twice the error rate of current state-of-
ingly claims or implies that state-of-the-art systems     the-art systems, and while I am not aware of any
are less capable than they actually are. This has         results from current state-of-the-art systems on this
taken several forms, including misleading presenta-       data, results from 2019 systems suggest that we are
tions of valid negative results from weak or dated        making substantial progress (Table 1). We have no
baseline models, misleading claims about the lim-         reason to expect, then, that the failures documented
its of what is conceptually possible with machine         in this work are quantitatively or qualitatively simi-
learning, misleading reporting of results on adver-       lar to the failures of current systems.
sarially collected data.                                     However, papers that cite these results often
                                                          present them with no discussion of the model under
2.1    Negative Results with Weak or Dated                study, yielding misleading implications. For exam-
       Baselines                                          ple, the award-winning work of Linzen (2020) uses
                                                          the Jia and Liang result to justify this claim:
In spite of many surprises and setbacks, NLP re-
search seems to have made genuine progress on                  [F]or current deep learning systems:
many problems over the last few years. In light of             when tested on cases sampled from a dis-
tribution that differs from the one they                       previously observed effects. Even so, this three-
      were trained on, their behavior is unpre-                      year lag makes it easy to seriously misjudge our
      dictable and inconsistent with that of hu-                     progress.
      mans                                                              The BERT-only results, though, represent a clear
                                                                     missed opportunity: There exist newer models like
The chief concern in this context is the claim that                  RoBERTa and DeBERTa (Liu et al., 2019; He et al.,
this failure applies to current deep learning systems                2020) which follow nearly identical APIs and ar-
in general, and the corresponding unjustified im-                    chitectures to BERT, such that it should generally
plication that these failures are a fundamental or                   be possible to reuse any BERT-oriented analysis
defining feature of neural network language mod-                     method on these newer models without modifica-
els. Looking only to highly-cited works from the                     tion. In many cases, these newer models are dif-
last two years that cite Jia and Liang, similar state-               ferent enough in their performance that we should
ments can be found in Xu et al. (2020), Zhang et al.                 expect analyzing them to yield different conclu-
(2020), and others.                                                  sions: For example, BERT performs slightly worse
                                                                     than chance on the few-shot Winograd Schema
The Long Shadow of BERT While the case of
                                                                     commonsense reasoning test set in SuperGLUE
Jia and Liang is especially striking since it deals
                                                                     (Levesque et al., 2011; Wang et al., 2019), while
with models that predate pretraining entirely, a sim-
                                                                     the newer DeBERTa reaches a near-perfect 96%
ilar effect is much more common in a subtler form:
                                                                     accuracy.
Most analysis papers that identify limitations of a
system come out well after the system description                    2.2   Strong Claims about Limits to
paper that claims the initial (typically positive) re-                     Understanding
sults. BERT, first released in Fall 2018, has been
a major locus for this kind of analysis work, and                    The influential work of Bender and Koller (2020)
continues to be a focus of such work long after its                  is centered on the argument that:
release. Looking to a random sample of ten papers
                                                                           [T]he language modeling task, because
from the NAACL 2021 analysis track that study
                                                                           it only uses form as training data, cannot
existing systems,2 . none of them analyze models
                                                                           in principle lead to learning of meaning.
that have come out since summer 2019, and five
only study BERT, representing a median lag of                           The proof of this claim is straightforward and
nearly three years from the release of a model to                    convincing under some (but not all) mainstream
the publication of the relevant analysis.                            definitions of the word meaning in the context of
   This trend has consequences for the conclusions                   NLP: If meaning deals with the relationship be-
that one would draw from a broad review of the                       tween language and some external nonlinguistic
recent literature on some problem: A review of that                  reality, then a system that can only interact with the
literature will contrast the successes of the best cur-              world through language cannot manipulate mean-
rent systems against the weaknesses of the best sys-                 ings.
tems from an earlier period. In many cases, these                       This argument does not on its own make any
weaknesses will be so severe as to challenge the                     prediction about the behavior of these models on
credibility of these successes if they are not prop-                 tasks that take place entirely through the medium
erly recognized as belonging to different model                      of language. On the definitions of meaning that
generations.                                                         must be assumed for this argument, tasks like sum-
   This analysis work is nonetheless often valuable                  marization or translation can in principle be solved
and these long timelines are sometimes justifiable:                  without reference to meaning: Under this defini-
Good analysis work takes time, and researchers do-                   tion, a translation model would be acting without
ing analysis work often have an incentive to focus                   reference to meaning even if it has a rich, struc-
on older models to ensure that they can reproduce                    tured internal model of the world, and it interprets
    2
      Papers studying only BERT: White et al. (2021); Slo-           sentences with reference to that model when trans-
bodkin et al. (2021); Bian et al. (2021); Cao et al. (2021);         lating: As long as that model of the world was
Pezeshkpour et al. (2021). Papers studying other models pre-         developed solely using language, and does not oth-
dating Fall 2019: Wallace et al. (2021); Hall Maudslay and
Cotterell (2021); Hollenstein et al. (2021); Bitton et al. (2021);   erwise ground out to perception or some other non-
Du et al. (2021)                                                     linguistic modality, no meaning has been used. See
Merrill et al. (2021) for some limits on how closely    2.3   Adversarially Collected Data
such a model can correspond to the real world and
                                                        Adversarially collected test sets (Bartolo et al.,
Bommasani et al. (2021, §2.6.3) for further dis-
                                                        2020; Nie et al., 2020; Kiela et al., 2021)—or test
cussion of the implications of Bender and Koller’s
                                                        sets composed of examples that some target sys-
arguments for NLP.
                                                        tem gets wrong—have recently become a popular
   In addition, this argument does justify any strong
                                                        tool in the evaluation of NLP systems. Datasets
prediction about the behavior of models which are
                                                        of this kind are crowdsourced in a setting where
trained primarily, but not exclusively, on a language
                                                        an example-writer can interact with a model (or
modeling objective, as with models that are fine-
                                                        ensemble) in real time and is asked to explore pos-
tuned to produce non-textual outputs like labels, or
                                                        sible ways of writing examples until they find an
models which are trained in a multimodal language-
                                                        example on which the model fails. Writers are gen-
and-vision regime.
                                                        erally paid or rewarded only for these failure cases,
   While this core claim is sound and important,
                                                        and the test section(s) of the final released dataset
public discussion of the paper has often repeated
                                                        will generally consist exclusively of these failure
the claim in ways that imply stronger conclusions
                                                        cases.
about model behavior. Utama et al. (2020), for
                                                           This process produces difficult test sets, and it
example, write that:
                                                        can be a useful tool in understanding the limits of
     Researchers have recently studied more             existing training sets or models (Williams et al.,
     closely the success of large fine-tuned            2020). However, the constraint that a specified
     LMs in many NLU tasks and found that               system must fail on the test examples makes it
     models are simply better in leveraging bi-         difficult to infer much from absolute measures of
     ased patterns instead of capturing a better        test-set performance. In particular, as long as a
     notion of language understanding for the           model makes any errors at all on some possible
     intended task (Bender and Koller, 2020).           inputs, then we expect it to be possible to construct
                                                        an adversarial test set against it, and we expect it to
 In another vein, Jang and Lukasiewicz (2021)           achieve zero test accuracy on that test set. We can
make the straightforward claim that                     further infer that any models that are sufficiently
                                                        similar to the adversary should also perform very
     Bender and Koller (2020) show that it is           poorly on this test set, regardless of their ability.
     impossible to learn the meaning of lan-            Neither of these observations would tell us anything
     guage by only leveraging the form of               non-trivial about the actual abilities of the models.
     sentences.                                            What’s more, in many NLU data collection ef-
                                                        forts, a significant fraction of cases of annota-
but they then use that claim to motivate a new reg-     tor disagreement represents subjective judgment
ularization technique for language models, which        calls, rather than annotator errors (Pavlick and
does nothing to change the fact that they are trained   Kwiatkowski, 2019). This means that even a per-
on form alone. In this context, it is hard to avoid     fectly careful and perfectly well-qualified human
the inference that Bender and Koller must have          annotator should be expected to disagree with the
shown something specific and contingent about re-       majority judgment on some examples, and will
cent language models. In fact, they make a more         thereby be coded as having made errors. It is there-
general point about the nature of meaning that ap-      fore possible to create an adversarial test set for
plies equally to any purely form-based language         which our perfect human annotator would reliably
model, no matter what objective it uses or how well     achieve 0% accuracy.
it performs.                                               Adversarially collected datasets are increasingly
   Similar claims can be found in many other citing     commonly used as test sets in standard experimen-
works (Utama et al., 2020; van Noord et al., 2020;      tal paradigms, and these caveats about the interpre-
Hovy and Yang, 2021; Sayers et al., 2021; Peti-         tation of results are not always clear when numbers
Stantić et al., 2021; Jang and Lukasiewicz, 2021).     are presented. I focus on Nie et al. (2020) as a
While Bender and Koller raise important points for      case study: Sampling papers that cite this work, it
discussion, these strong inferences in citing works     is easy to find references that do not mention the
are misleading and potentially harmful.                 adversarial design of the data and that make claims
that are hard to justify:3 Talmor et al. (2020) use             4     Why Underclaiming is Harmful
the results from Nie et al. to claim that “LMs do
not take into account the presence of negation in               Research papers are generally most useful when
sentences”, and Hidey et al. (2020) use them to                 they’re true and informative, and a research field
justify the claim that “examples for numerical rea-             that allows misleading claims to go unchallenged
soning and lexical inference have been shown to be              is likely to lose credibility with serious funders,
difficult.” Liu et al. (2020) similarly use absolute re-        reporters, or industry stakeholders. This is the most
sults on the adversary models to back up the trivial            obvious reason that we should be concerned about
but easily-misread claim that BERT-style models                 underclaiming, but it is not the whole story.
“may still suffer catastrophic failures in adversarial             This loss of insight and loss of credibility can
scenarios.” McClelland et al. (2019) use absolute               seriously challenge our ability to anticipate, under-
results on Nie et al.’s ANLI data to make similar               stand, and manage the impacts of deploying NLP
claims about the limitations of GPT-3 (Brown et al.,            systems. This is especially true of impacts that are
2020).                                                          contingent on NLP technologies actually working
                                                                well, which we should expect will become more
                                                                substantial as time goes on.
3       A Word on Hype
                                                                4.1    Present-Day Impact Mitigation

The previous section has laid out some ways in                  The deployment of modern NLP systems has had
which the mainstream NLP research community                     significant positive and negative impacts on the
often makes unjustifiable claims about the limita-              world. Technical researchers in NLP have an ethi-
tions of state-of-the-art methods. These claims do              cal obligation to inform (and if necessary, pressure)
not make the opposite phenomenon, hype, any less                stakeholders about how to avoid or mitigate the
real or any less harmful.                                       negative impacts while realizing the positive ones.
                                                                   Most prominently, most applied NLP models
   While hype is likely most severe in industry PR              show serious biases with respect to legally pro-
and in the media,4 it is nonetheless still prevalent            tected attributes like race and gender (Bolukbasi
in the research literature. In one especially clear ex-         et al., 2016; Rudinger et al., 2018; Nangia et al.,
ample, a prominent paper claiming of human parity               2020; Bender et al., 2021). We have no reliable
in machine translation performance (Hassan et al.,              mechanisms to mitigate these biases and no rea-
2018) severely overstates what has been accom-                  son to believe that they will be satisfactorily re-
plished relative to commonsense intuitions about                solved with larger scale. Worse, it is not clear that
what a human-level translation system would do                  even superhuman levels of fairness on some mea-
(Toral et al., 2018; Läubli et al., 2018; Zhang and             sures would be satisfactory: Fairness norms can
Toral, 2019; Graham et al., 2020).                              conflict with one another, and in some cases, a ma-
   I do not aim to argue that overclaiming or hype              chine decision-maker will be given more trust and
is acceptable or safe. However, unlike the under-               deference than a human decision-maker would in
claiming that this paper focuses on, hype is re-                the same situation (see, e.g., Rudin et al., 2020;
ceiving substantial attention, and we have some                 Fazelpour and Lipton, 2020). We are standing on
established tools and norms within peer review to               shaky moral and political grounds when we deploy
combat it. Finally, combating hype should be fully              present systems in high-impact settings, but they
compatible with the goals laid out in this paper,               are being widely deployed anyway (e.g. Dastin,
and broad-based efforts to improve our practices                2018; Nayak, 2019; Dansby et al., 2020). Be-
in evaluation, analysis, writing, and forecasting               yond bias, similar present-day concerns can be seen
should help reduce both underclaiming and hype.                 around issues involving minority languages and di-
                                                                alects, deceptive design, and the concentration of
    3
                                                                power (Joshi et al., 2020; Bender et al., 2021; Ken-
     I focus here about claims about the absolute performance
level of models. Whether adversarially collected test sets      ton et al., 2021, §3.3).
are appropriate for comparing the relative effectiveness of        Persuading the operators of deployed systems to
models is a largely orthogonal issue (Bowman and Dahl, 2021;    take these issues seriously, and to mitigate harms
Kaushik et al., 2021).
   4
     Consider the 2017 Huffington Post headline “Facebook       or scale back deployments when necessary, will be
Shuts Down AI Robot After It Creates Its Own Language.”         difficult. Intuitively, researchers concerned about
these harms may find it appealing to emphasize the          Looking somewhat further into the future, a sub-
limitations of models in the hope that this will dis-    stantial community of philosophers, economists,
courage the deployment of harmful systems. This          and general ML researchers are concerned that
is appropriate when done carefully, and could be a       highly-capable AI systems, of the kind that could
way in which the field’s increasing focus on model       plausibly be developed through existing ML re-
analysis could reduce some of these harms.               search paradigms, are catastrophically dangerous
   However, this kind of strategic underclaiming         by default (Bostrom, 2012; Critch and Krueger,
can backfire if done sloppily: Models are often both     2020; Christian, 2020; Ord, 2020; Russell and
useful and harmful, especially when the operator         Norvig, 2020). Expert forecasts suggest that this
of the system is not the one being harmed. If the        could take place within a few decades (Grace et al.,
operator of some deployed system sees firsthand          2018). If these hypotheses hold, and if we are
that the system is effective for their purposes, they    poorly prepared for these developments, the worst-
have little reason to trust researchers who argue that   case outcomes could be catastrophic, even threaten-
that same system does not understand language,           ing the continued existence of human civilization
or who argue something similarly broad and nega-         on some views.
tive. They will then be unlikely to listen to those         Investments in research into these potential catas-
researchers’ further claims that such a system is        trophic risks from advanced machine learning sys-
harmful, even if those further claims are accurate.      tems have become substantial: Funding from one
                                                         charitable foundation alone has totaled over $75M
4.2   Longer-Term Risks                                  USD.5 Concerns about risks from AI have also
                                                         been a major stated motivation for a significant
We can reasonably expect NLP systems to im-
                                                         fraction of the work from DeepMind and OpenAI,
prove over the coming decades. Even if intellectual
                                                         which both have access to even greater amounts of
progress from research were to slow, the dropping
                                                         private funding.
price of compute should allow us to continue to
                                                            Spurred on in particular by the shift in emergent
reap the benefits of larger-scale training (Kaplan
                                                         capabilities from GPT-2 to GPT-3, the attention of
et al., 2020; Brown et al., 2020). This improvement
                                                         these AI risk researchers has also been increasingly
in capabilities is likely to amplify both the harms
                                                         centered on language models and similar multi-
and benefits of language technology.
                                                         modal models (Irving et al., 2018; Stiennon et al.,
   We have good reason to expect that this further
                                                         2020; Hendrycks et al., 2020; Kenton et al., 2021;
progress in NLP will lead to upheavals in areas like
                                                         Wu et al., 2021; Bommasani et al., 2021, §4.9).
education, medicine, law, and the service sector
                                                         Despite the scale of this research, and its recent
more broadly, as well as making mass surveillance
                                                         shift of focus toward language models, there has
and misinformation campaigns far more effective,
                                                         been little interaction between the AI risk research
and opening up additional new use cases that will
                                                         community and mainstream academic NLP to date.
be hard for us to foresee (Brundage et al., 2018;
                                                            To the extent that these concerns are valid, they
Tamkin et al., 2021; Bommasani et al., 2021). One
                                                         represent an urgent call for reprioritization within
can reasonably expect that the positive and neg-
                                                         NLP research to favor safety-relevant areas like in-
ative impacts of these upheavals will far exceed
                                                         terpretability, control, and evaluation over scaling,
the impacts that our technologies have produced to
                                                         and to push for better oversight and regulation of
date. In turn, NLP researchers who want to ensure
                                                         large-scale research (Dafoe, 2018): Even a small
that their career has a net-positive impact on the
                                                         risk of a globally significant catastrophe warrants a
world should be concerned with these longer-term
                                                         dramatic response.
developments.
                                                            On the other hand, to the extent that these con-
   It will be difficult to do the necessary technical,
                                                         cerns are unfounded or are built on misunderstand-
social, and governance work to prepare for these
                                                         ings about the possible trajectories of ML research,
advances if we do not have a clear picture of our
                                                         it would be quite valuable to correct this misun-
current capabilities, and it will be difficult to con-
                                                         derstanding. Correcting the record could redirect
vince outside stakeholders to act appropriately to
                                                         these resources and, more significantly, reduce the
mitigate these risks if we don’t acknowledge that
we have made, and are making, real progress to-            5
                                                             https://www.openphilanthropy.org/
ward effective language technology.                      giving/grants
risk of popular or regulatory pressure that limits             serve as good proxies for what we want in in famil-
large-scale ML research so severely or clumsily as             iar situations can break down in new situations.
to snuff out the positive potential of NLP technolo-              Further, optimizing complex systems is difficult,
gies.                                                          such that even a genuinely safe objective, if op-
                                                               timized powerfully by a system with an imper-
5   Why Worry?                                                 fect learned representation of the objective, could
Because these more speculative concerns about AI               produce dangerous failure modes (Hubinger et al.,
risk are rarely discussed in the NLP literature, I             2019).
will here offer a brief overview of that work. Con-               Finally, we should expect highly-capable AI sys-
cerns around dangers from advanced artificial intel-           tems to be useful in the short term, giving potential
ligence tend to focus on four clusters of hypotheses:          users a strong incentive to deploy them as soon
                                                               as they are affordable, even if their safety is not
Powerful       Unaccountable        Organizations              guaranteed. This means that it is not enough that
Highly-capable AI is likely to lead to highly-                 it simply be possible for us to develop safe sys-
profitable applications. Highly-capable AI should              tems, it is additionally necessary that it be nearly as
also be able to displace human labor in technical              easy and nearly as affordable as developing unsafe
fields to a large extent, increasing the relative              systems (Irving et al., 2018).
value of capital over labor, and making it easier
for the leaders of profitable organizations to take            Risks May Be Difficult to Spot Human-level ca-
unilateral unpopular actions. In the longer term,              pabilities are likely to emerge first from large ma-
highly-capable AI may also contribute to the                   chine learning models that, like modern neural net-
effectiveness of persuasion campaigns, further                 works, are not directly interpretable. This means
insulating these organizations from political or               that it may be difficult to spot ways in which a
popular pressure. Combined with significant                    model is unsafe or forecast ways in which its be-
further technological progress, these forces could             havior might change in novel settings (Critch and
conspire to produce make the companies or                      Krueger, 2020).
governments that first produce highly-capable AI
almost entirely unaccountable, and allowing their              Instrumentally-Convergent Subgoals The in-
decisions to play a major role in the trajectory of            strumental convergence hypothesis holds that sys-
humanity as a whole (Ord, 2020).                               tems that are optimizing for benign objectives,
                                                               once they become sufficiently capable, have a pre-
Alignment and Robustness Even if a system is
                                                               dictable reason to take on dangerous subgoals—
deployed by an actor with good intentions and ap-
                                                               like accumulating large amounts of computational,
propriate oversight, good outcomes are not guar-
                                                               economic, or political power—to maximize the
anteed. As AI systems become more capable, it
                                                               odds that their primary objectives are achieved
becomes more likely that they will be trusted to
                                                               (Bostrom, 2003; Omohundro, 2008; Bostrom,
make high-stakes decisions autonomously. In these
                                                               2012).7 Even with merely near-human-like lev-
cases, it becomes crucial that they behave in ways
                                                               els of performance, the ability of computational
that we would endorse, even when they are pushed
                                                               models to be copied and accelerated gives them
into unfamiliar new situations. This requires both
                                                               considerable leeway to act in un-human-like ways.
that the systems be optimized for the right objec-
                                                               Systems that interact with humans only through
tives and that the systems actually internalize and
                                                               text, or systems whose goals are circumscribed to a
generalize those objectives correctly.
                                                               well-defined task like question answering, are not
   Specifying and using safe objectives, such that
                                                               exempt from this concern (Armstrong et al., 2012).
aggressively optimizing them does not produce
catastrophic outcomes, is difficult (Critch and                   7
                                                                    This is exemplified by the thought experiment of the pa-
Krueger, 2020). Human preferences are complex,                 perclip maximizer, which points out that a machine which
making the problem of specifying an objective                  tasked with manufacturing as many paperclips as possible, if
                                                               sufficiently capable, can be expected to turn nearly all matter
that rules out unintended bad behavior non-trivial.            on earth into paperclips. While this vision of a single system
Goodhart’s law6 means that many objectives that                acting alone on a trivial objective is unrealistic, it demonstrates
                                                               the key hypothesis that almost any reasonable-sounding goal
   6
     in the formulation of Strathern (1997): “when a measure   starts to conflict with basic human needs if a sufficiently capa-
becomes a target, it ceases to be a good measure.”             ble system pursues it single-mindedly.
So What? None of these hypotheses is simple                 (usually English) alone. In that spirit, I propose
to prove, but as far as I am aware, all have re-            another rule of thumb:
sisted straightforward attempts at falsification. All
                                                                  When describing the failure of a machine
four are potentially applicable to neural network-
                                                                  learning model on some empirical evalu-
based models and to models which operate primar-
                                                                  ation, make it clear
ily through language.
   While the nascent field of AI alignment has pro-                  i. what kind of model has failed,
posed some mechanisms by which we might mit-                        ii. whether the model is significantly
igate these risks, work in this area is still largely                   less capable than the current state
exploratory, with no clear research agenda in place                     of the art in the domain, and
to ensure that powerful ML models will be safe                     iii. whether the evaluation was deliber-
(Hadfield-Menell et al., 2016; Irving et al., 2018;                     ately set up to trick that model or
Critch and Krueger, 2020; Kenton et al., 2021). If                      another model like it.
these hypotheses hold, significant further work is
                                                                 The latter two can be points can be left
needed to ensure a safe long-term outcome. This
                                                                 unstated when they do not apply.
will be difficult to achieve without a clear account-
ing of the abilities and limitations of current and            While a corresponding rule could be helpful in
plausible near-future systems. In particular, we will       the context of results describing the success of a
need enough foresight to be able to see substantial         machine learning system on some evaluation, the
progress of this kind coming well in advance, to            asymmetry here is intentional: Successes are likely
avoid the complacency that comes with the percep-           to be deliberately replicated from one generation
tion that worrying about impacts from powerful AI           of models to the next, creating a trend for a positive
is like worrying about “overpopulation on mars”             finding about older systems to remain true of newer
(Garling, 2015, quoting Andrew Ng).                         systems in most cases. The opposite is true of
                                                            model failures.
6   Ways to Do Better
                                                            Better Evaluation The rise of underclaiming
The core issue in this paper is one of sloppy com-          can likely be attributed, at least in large part, to
munication about results. The most straightforward          the ineffectiveness of current evaluation practices
step that we can take to remedy underclaiming is            in many areas of NLP. If encouraging numbers on
to simply use the same practices that we already            widely-used benchmarks are frequently met with
use to avoid overclaiming: The peer-review pro-             disappointing subsequent evidence, suggesting that
cess already polices overclaiming to a significant          good evaluation numbers don’t translate to effec-
extent, and most researchers have learned to be             tive systems, it is rational to treat new encouraging
careful about overclaiming in their writing. We             results with extreme skepticism.
should apply high standards of evidence to our                 Better benchmarks and evaluation practices
own empirical claims and those of others, both in           could help mitigate this by providing a firmer
peer-reviewed venues and in more informal scien-            ground on which to make positive claims about
tific communication, even when those claims are             system capacities.8 In practice, research into more
negative and cloaked in a frame of individual or            effective crowdsourcing and benchmark design
field-level modesty.                                        (Kiela et al., 2021; Bowman and Dahl, 2021; Nan-
    In addition, there are specific best practices or re-   gia et al., 2021; Gehrmann et al., 2021) and re-
search directions that can help make these mistakes         search into better statistical reporting and publica-
harder to make:                                             tion norms (Dodge et al., 2019; Card et al., 2020;
                                                            Rogers and Augenstein, 2020; van Miltenburg et al.,
A New Rule? In light of the issues with negative
                                                            2021) seem especially high-impact under this lens.
results on older models discussed in Section 2.1,
it could be productive to introduce a new heuristic         Better Analysis We can help address the time-
when reviewing or evaluating papers. The Ben-               lag issue discussed in Section 2.1 by building tool-
der Rule (Bender, 2019) asks authors to explicitly          ing to make it easier to adapt existing analysis tech-
name the natural languages that they study, so as           niques to new models seamlessly. Leaderboards
to avoid misleading claims about language or lan-              8
                                                                 Though Raji et al. (2021) point out ways in which better
guages on the basis of evidence from one language           benchmarking alone is unlikely to be fully satisfactory.
that integrate conventional benchmarking with anal-     struggled to avoid hype and inflated expectations
ysis can be especially helpful by making this largely   about the capabilities of state-of-the-art technology.
automatic (Wang et al., 2018; Dua et al., 2019;         This remains a serious issue. However, this paper
Gehrmann et al., 2021; Ma et al., 2021).                argues that our attempts to avoid hype have begun
   More broadly, careful work on model robustness       to overshoot: Instead of merely correcting overly
and interpretation, targeted at capable models, will    optimistic claims about our capabilities, we have
be valuable in helping to forecast and mitigate the     in some cases started to replace them with overly
worst risks from future systems.                        pessimistic claims.
                                                           Making misleading claims is generally a bad
Better Forecasting Scaling laws results in NLP          sign for the health and credibility of a scientific
(Hestness et al., 2017; Kaplan et al., 2020; Brown      field, and the stakes are high: NLP technologies are
et al., 2020; Zhang et al., 2021) offer the promise     implicated in a range of serious real-world harms,
that we can predict the performance of future larger-   and plausible future elaborations of these technolo-
scale machine learning models on at least some          gies are potentially much more dangerous still. Our
metrics. This line of work is still nascent, and suc-   ability to mitigate existing harms will depend on
cesses to date have largely focused on loss values      our ability to make reliably credible claims about
rather than more interpretable measures of capabil-     the limitations of our systems. Our ability to mit-
ity. Further developing these methods, as well as       igate future harms will depend on our ability to
others that allow us to better forecast near future     accurately anticipate, recognize, agree upon, and
progress, should be helpful. Better forecasting will    report upon emerging capabilities. Both of these
provide a useful way to sanity-check future claims      goals are seriously hampered by claims that current
(DellaVigna et al., 2019) and will help improve         technologies are less capable than they in fact are.
the responsiveness of model analysis by enabling           Better evaluation, better tooling for model analy-
us to prepare analysis methods and datasets that        sis, and better mechanisms for technical forecasting
anticipate future capabilities.                         should all contribute to making these pessimistic
                                                        claims easier to avoid or debunk. However, this
7   Additional Related Work
                                                        problem is ultimately one of scientific communica-
While much of this paper discusses the state of         tion, and to solve it fully, we will need to use the
the NLP literature, there are some related works        tools and norms of science to better police false or
that warrant emphasis as starting points for further    misleading claims.
reading:
   Bender and Koller (2020), Bender et al. (2021),      Acknowledgements
and Raji et al. (2021) discuss the role of hype in
                                                        The ideas and arguments in this work were devel-
driving bad outcomes from the development of lan-
                                                        oped through conversation (live or on Twitter) with
guage technology. Jin et al. (2021) and Rogers
                                                        more researchers than I can name, including Emiel
(2021) offer broader discussion of how to ensure
                                                        van Miltenburg, Deb Raji, Paul Christiano, Geof-
that the net impact of near-future NLP deployments
                                                        frey Irving, Rohin Shah, Jacob Steinhardt, Cather-
on the world is positive. Looking to the longer term,
                                                        ine Olsson, Nick Beckstead, Alex Tamkin, Daniel
Bommasani et al. (2021, §4.9) provides an intro-
                                                        Dewey, Alex Ray, and Robin Jia. Anna Rogers,
duction to the AI risk and AI alignment literature
                                                        Owain Evans, Alex Wang, Jacob Steinhardt, Jason
from a perspective that emphasizes NLP and lan-
                                                        Phang, Jared Kaplan, Alex Cristia, Tal Linzen, and
guage. Welty et al. (2019), Linzen (2020), Ribeiro
                                                        Jonathan Uesato provided feedback on drafts. Any
et al. (2020), Raji et al. (2021), Bowman and Dahl
                                                        errors or dubious rhetoric are my own.
(2021), and Dehghani et al. (2021), among many
                                                           This project has benefited from financial support
others, discuss the challenges involved in design-
                                                        by Eric and Wendy Schmidt (made by recommen-
ing evaluations that yield trustworthy and accurate
                                                        dation of the Schmidt Futures program) and Apple.
depictions of the capabilities of ML models.
                                                        This material is based upon work supported by
8   Conclusion                                          the National Science Foundation under Grant Nos.
                                                        1922658 and 2046556. Any opinions, findings, and
Like many research fields that have a tight connec-     conclusions or recommendations expressed in this
tion to a consumer products industry, NLP has long      material are those of the author(s) and do not nec-
essarily reflect the views of the National Science         Nick Bostrom. 2003. Ethical issues in advanced arti-
Foundation.                                                  ficial intelligence. Science fiction and philosophy:
                                                             from time travel to superintelligence, pages 277–
                                                             284.
References                                                 Nick Bostrom. 2012. The superintelligent will: Moti-
Stuart Armstrong, Anders Sandberg, and Nick                  vation and instrumental rationality in advanced arti-
   Bostrom. 2012. Thinking inside the box: Control-          ficial agents. Minds and Machines, 22(2):71–85.
   ling and using an oracle AI. Minds and Machines,        Samuel R. Bowman and George Dahl. 2021. What will
   22(4):299–324.                                            it take to fix benchmarking in natural language un-
                                                             derstanding? In Proceedings of the 2021 Confer-
Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas-
                                                             ence of the North American Chapter of the Associ-
 tian Riedel, and Pontus Stenetorp. 2020. Beat the
                                                             ation for Computational Linguistics: Human Lan-
 AI: Investigating adversarial human annotation for
                                                             guage Technologies, pages 4843–4855, Online. As-
 reading comprehension. Transactions of the Associ-
                                                             sociation for Computational Linguistics.
 ation for Computational Linguistics, 8:662–678.
                                                           Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Emily Bender. 2019. The #benderrule: On naming the           Subbiah, Jared D Kaplan, Prafulla Dhariwal,
  languages we study and why it matters. The Gradi-          Arvind Neelakantan, Pranav Shyam, Girish Sastry,
  ent.                                                       Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Emily M. Bender, Timnit Gebru, Angelina McMillan-            Voss, Gretchen Krueger, Tom Henighan, Rewon
  Major, and Shmargaret Shmitchell. 2021. On the             Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
  dangers of stochastic parrots: Can language models         Clemens Winter, Chris Hesse, Mark Chen, Eric
  be too big? [parrot emoji]. In Proceedings of the          Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
  2021 ACM Conference on Fairness, Accountability,           Jack Clark, Christopher Berner, Sam McCandlish,
  and Transparency, FAccT ’21, page 610–623, New             Alec Radford, Ilya Sutskever, and Dario Amodei.
  York, NY, USA. Association for Computing Machin-           2020. Language models are few-shot learners. In
  ery.                                                       Advances in Neural Information Processing Systems,
                                                             volume 33, pages 1877–1901. Curran Associates,
Emily M. Bender and Alexander Koller. 2020. Climb-           Inc.
  ing towards NLU: On meaning, form, and under-
  standing in the age of data. In Proceedings of the       Miles Brundage, Shahar Avin, Jack Clark, Helen Toner,
  58th Annual Meeting of the Association for Compu-          Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul
  tational Linguistics, pages 5185–5198, Online. As-         Scharre, Thomas Zeitzoff, Bobby Filar, et al. 2018.
  sociation for Computational Linguistics.                   The malicious use of artificial intelligence: Fore-
                                                             casting, prevention, and mitigation. arXiv preprint
Yuchen Bian, Jiaji Huang, Xingyu Cai, Jiahong Yuan,         1802.07228.
  and Kenneth Church. 2021. On attention redun-
  dancy: A comprehensive study. In Proceedings of          Steven Cao, Victor Sanh, and Alexander Rush. 2021.
  the 2021 Conference of the North American Chap-             Low-complexity probing via finding subnetworks.
  ter of the Association for Computational Linguistics:       In Proceedings of the 2021 Conference of the North
  Human Language Technologies, pages 930–945, On-            American Chapter of the Association for Computa-
  line. Association for Computational Linguistics.            tional Linguistics: Human Language Technologies,
                                                              pages 960–966, Online. Association for Computa-
Yonatan Bitton, Gabriel Stanovsky, Roy Schwartz, and          tional Linguistics.
  Michael Elhadad. 2021. Automatic generation of
                                                           Dallas Card, Peter Henderson, Urvashi Khandelwal,
  contrast sets from scene graphs: Probing the compo-
                                                             Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020.
  sitional consistency of GQA. In Proceedings of the
                                                             With little power comes great responsibility. In
  2021 Conference of the North American Chapter of
                                                             Proceedings of the 2020 Conference on Empirical
  the Association for Computational Linguistics: Hu-
                                                             Methods in Natural Language Processing (EMNLP),
  man Language Technologies, pages 94–105, Online.
                                                             pages 9263–9274, Online. Association for Computa-
  Association for Computational Linguistics.
                                                             tional Linguistics.
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,               Brian Christian. 2020. The Alignment Problem: Ma-
  Venkatesh Saligrama, and Adam T Kalai. 2016.               chine Learning and Human Values. WW Norton &
  Man is to computer programmer as woman is to               Company.
  homemaker? debiasing word embeddings. Ad-
  vances in neural information processing systems,         Andrew Critch and David Krueger. 2020. AI re-
  29:4349–4357.                                              search considerations for human existential safety
                                                             (ARCHES). arXiv preprint 2006.04948.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ
  Altman, Simran Arora, Sydney von Arx, Michael S          Allan Dafoe. 2018.      AI governance: a research
  Bernstein, Jeannette Bohg, Antoine Bosselut, Emma          agenda. Governance of AI Program, Future of Hu-
  Brunskill, et al. 2021. On the opportunities and risks     manity Institute, University of Oxford: Oxford, UK,
  of foundation models. arXiv preprint 2108.07258.           1442:1443.
Ryan Dansby, Han Fang, Hao Ma, Chris Moghbel,              Singh, Noah A. Smith, Sanjay Subramanian, Reut
  Umut Ozertem, Xiaochang Peng, Ves Stoyanov,              Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.
  Sinong Wang, Fan Yang, and Kevin Zhang. 2020.            2020. Evaluating models’ local decision boundaries
  AI advances to better detect hate speech. Facebook       via contrast sets. In Findings of the Association
  AI Blog.                                                 for Computational Linguistics: EMNLP 2020, pages
                                                           1307–1323, Online. Association for Computational
Jeffrey Dastin. 2018. Amazon scraps secret AI recruit-     Linguistics.
   ing tool that showed bias against women. Reuters.
                                                         Caleb Garling. 2015. Andrew Ng: Why ‘deep learning’
Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe          is a mandate for humans, not just machines. Reuters.
 Zhao, Neil Houlsby, Fernando Diaz, Donald Met-
 zler, and Oriol Vinyals. 2021. The benchmark lot-       Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag-
 tery. arXiv preprint 2107.07002.                          garwal, Pawan Sasanka Ammanamanchi, Aremu
                                                           Anuoluwapo, Antoine Bosselut, Khyathi Raghavi
Stefano DellaVigna, Devin Pope, and Eva Vivalt.            Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D
   2019. Predict science to improve science. Science,      Dhole, et al. 2021. The GEM benchmark: Natu-
   366(6464):428–429.                                      ral language generation, its evaluation and metrics.
                                                           arXiv preprint 2102.01672.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of       Katja Grace, John Salvatier, Allan Dafoe, Baobao
   deep bidirectional transformers for language under-     Zhang, and Owain Evans. 2018. When will ai ex-
   standing. In Proceedings of the 2019 Conference         ceed human performance? evidence from ai experts.
   of the North American Chapter of the Association        Journal of Artificial Intelligence Research, 62:729–
   for Computational Linguistics: Human Language           754.
  Technologies, Volume 1 (Long and Short Papers),        Yvette Graham, Barry Haddow, and Philipp Koehn.
   pages 4171–4186, Minneapolis, Minnesota. Associ-        2020. Statistical power and translationese in ma-
   ation for Computational Linguistics.                    chine translation evaluation. In Proceedings of the
                                                           2020 Conference on Empirical Methods in Natural
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy           Language Processing (EMNLP), pages 72–81, On-
   Schwartz, and Noah A. Smith. 2019. Show your            line. Association for Computational Linguistics.
   work: Improved reporting of experimental results.
   In Proceedings of the 2019 Conference on Empirical    Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel,
   Methods in Natural Language Processing and the          and Anca Dragan. 2016. Cooperative inverse rein-
   9th International Joint Conference on Natural Lan-      forcement learning. Advances in neural information
   guage Processing (EMNLP-IJCNLP), pages 2185–            processing systems, 29:3909–3917.
   2194, Hong Kong, China. Association for Computa-
   tional Linguistics.                                   Rowan Hall Maudslay and Ryan Cotterell. 2021. Do
                                                           syntactic probes probe syntax? experiments with
Mengnan Du, Varun Manjunatha, Rajiv Jain, Ruchi            jabberwocky probing. In Proceedings of the 2021
 Deshpande, Franck Dernoncourt, Jiuxiang Gu, Tong          Conference of the North American Chapter of the
 Sun, and Xia Hu. 2021. Towards interpreting and           Association for Computational Linguistics: Human
 mitigating shortcut learning behavior of NLU mod-         Language Technologies, pages 124–131, Online. As-
 els. In Proceedings of the 2021 Conference of the         sociation for Computational Linguistics.
 North American Chapter of the Association for Com-
 putational Linguistics: Human Language Technolo-        Hany Hassan, Anthony Aue, Chang Chen, Vishal
 gies, pages 915–929, Online. Association for Com-         Chowdhary, Jonathan Clark, Christian Feder-
 putational Linguistics.                                   mann, Xuedong Huang, Marcin Junczys-Dowmunt,
                                                           William Lewis, Mu Li, et al. 2018. Achieving hu-
Dheeru Dua, Ananth Gottumukkala, Alon Talmor,              man parity on automatic chinese to english news
  Sameer Singh, and Matt Gardner. 2019. Orb: An            translation. arXiv preprint 1803.05567.
  open reading benchmark for comprehensive evalua-       Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
  tion of machine reading comprehension. In EMNLP          Weizhu Chen. 2020.      DeBERTa: Decoding-
  2019 MRQA Workshop, page 147.                            enhanced BERT with disentangled attention. arXiv
                                                           preprint 2006.03654.
Sina Fazelpour and Zachary C Lipton. 2020. Algorith-
   mic fairness from a non-ideal perspective. In Pro-    Dan Hendrycks, Collin Burns, Steven Basart, Andrew
   ceedings of the AAAI/ACM Conference on AI, Ethics,      Critch, Jerry Li, Dawn Song, and Jacob Steinhardt.
   and Society, pages 57–63.                               2020. Aligning AI with shared human values. In
                                                           Proceedings of ICLR.
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
 Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,          Joel Hestness, Sharan Narang, Newsha Ardalani, Gre-
 Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,            gory Diamos, Heewoo Jun, Hassan Kianinejad,
 Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,       Md Patwary, Mostofa Ali, Yang Yang, and Yanqi
 Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-           Zhou. 2017. Deep learning scaling is predictable,
 son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer           empirically. arXiv preprint 1712.00409.
Christopher Hidey, Tuhin Chakrabarty, Tariq Alhindi,        Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
  Siddharth Varia, Kriste Krstovski, Mona Diab, and         2020. Scaling laws for neural language models.
  Smaranda Muresan. 2020. DeSePtion: Dual se-               arXiv preprint 2001.08361.
  quence prediction and adversarial examples for im-
  proved fact-checking. In Proceedings of the 58th An-    Divyansh Kaushik, Douwe Kiela, Zachary C Lipton,
  nual Meeting of the Association for Computational         and Wen-tau Yih. 2021. On the efficacy of adversar-
  Linguistics, pages 8593–8606, Online. Association         ial data collection for question answering: Results
  for Computational Linguistics.                            from a large-scale randomized study. arXiv preprint
                                                            2106.00872.
Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena
  Jäger, and Lisa Beinborn. 2021. Multilingual lan-       Zachary Kenton, Tom Everitt, Laura Weidinger, Ia-
  guage models predict human reading behavior. In           son Gabriel, Vladimir Mikulik, and Geoffrey Irving.
  Proceedings of the 2021 Conference of the North           2021. Alignment of language agents. arXiv preprint
  American Chapter of the Association for Computa-          2103.14659.
  tional Linguistics: Human Language Technologies,
  pages 106–123, Online. Association for Computa-         Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh
  tional Linguistics.                                       Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid-
                                                            gen, Grusha Prasad, Amanpreet Singh, Pratik Ring-
Dirk Hovy and Diyi Yang. 2021. The importance of            shia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel,
  modeling social factors of language: Theory and           Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mo-
  practice. In Proceedings of the 2021 Conference of        hit Bansal, Christopher Potts, and Adina Williams.
  the North American Chapter of the Association for         2021. Dynabench: Rethinking benchmarking in
  Computational Linguistics: Human Language Tech-           NLP. In Proceedings of the 2021 Conference of
  nologies, pages 588–602, Online. Association for          the North American Chapter of the Association for
  Computational Linguistics.                                Computational Linguistics: Human Language Tech-
                                                            nologies, pages 4110–4124, Online. Association for
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik,         Computational Linguistics.
  Joar Skalse, and Scott Garrabrant. 2019. Risks from
  learned optimization in advanced machine learning       Samuel Läubli, Rico Sennrich, and Martin Volk. 2018.
  systems. arXiv preprint 1906.01820.                       Has machine translation achieved human parity? a
                                                            case for document-level evaluation. In Proceed-
Geoffrey Irving, Paul Christiano, and Dario Amodei.         ings of the 2018 Conference on Empirical Methods
  2018.     AI safety via debate.     arXiv preprint        in Natural Language Processing, pages 4791–4796,
  1805.00899.                                               Brussels, Belgium. Association for Computational
                                                            Linguistics.
Myeongjun Jang and Thomas Lukasiewicz. 2021.
 NoiER: An approach for training more reliable fine-      Hector J Levesque, Ernest Davis, and Leora Morgen-
 tuned downstream task models. arXiv preprint               stern. 2011. The Winograd schema challenge. In
 2110.02054.                                                AAAI Spring Symposium: Logical Formalizations of
                                                            Commonsense Reasoning, volume 46, page 47.
Robin Jia and Percy Liang. 2017. Adversarial exam-
  ples for evaluating reading comprehension systems.      Tal Linzen. 2020. How can we accelerate progress to-
  In Proceedings of the 2017 Conference on Empiri-          wards human-like linguistic generalization? In Pro-
  cal Methods in Natural Language Processing, pages         ceedings of the 58th Annual Meeting of the Asso-
  2021–2031, Copenhagen, Denmark. Association for           ciation for Computational Linguistics, pages 5210–
  Computational Linguistics.                                5217, Online. Association for Computational Lin-
                                                            guistics.
Zhijing Jin, Geeticka Chauhan, Brian Tse, Mrinmaya
  Sachan, and Rada Mihalcea. 2021. How good is            Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu
  NLP? a sober look at NLP tasks through the lens           Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao.
  of social impact. In Findings of the Association          2020. Adversarial training for large neural language
  for Computational Linguistics: ACL-IJCNLP 2021,           models. arXiv preprint 2004.08994.
  pages 3099–3113, Online. Association for Computa-
  tional Linguistics.                                     Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
                                                            dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika        Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Bali, and Monojit Choudhury. 2020. The state and          RoBERTa: A robustly optimized BERT pretraining
  fate of linguistic diversity and inclusion in the NLP     approach. arXiv preprint 1907.11692.
  world. In Proceedings of the 58th Annual Meet-
  ing of the Association for Computational Linguistics,   Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya
  pages 6282–6293, Online. Association for Computa-         Jain, Ledell Wu, Robin Jia, Christopher Potts, Ad-
  tional Linguistics.                                       ina Williams, and Douwe Kiela. 2021. Dynaboard:
                                                            An evaluation-as-a-service platform for holistic
Jared Kaplan, Sam McCandlish, Tom Henighan,                 next-generation benchmarking.      arXiv preprint
   Tom B Brown, Benjamin Chess, Rewon Child, Scott          2106.06052.
James L McClelland, Felix Hill, Maja Rudolph, Jason       Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent
  Baldridge, and Hinrich Schütze. 2019. Extending            disagreements in human textual inferences. Transac-
  machine language models toward human-level lan-            tions of the Association for Computational Linguis-
  guage understanding. arXiv preprint 1912.05877.            tics, 7:677–694.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.           Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
  Right for the wrong reasons: Diagnosing syntactic        Gardner, Christopher Clark, Kenton Lee, and Luke
  heuristics in natural language inference. In Proceed-    Zettlemoyer. 2018. Deep contextualized word rep-
  ings of the 57th Annual Meeting of the Association       resentations. In Proceedings of the 2018 Confer-
  for Computational Linguistics, pages 3428–3448,          ence of the North American Chapter of the Associ-
  Florence, Italy. Association for Computational Lin-      ation for Computational Linguistics: Human Lan-
  guistics.                                                guage Technologies, Volume 1 (Long Papers), pages
                                                           2227–2237, New Orleans, Louisiana. Association
                                                           for Computational Linguistics.
William Merrill, Yoav Goldberg, Roy Schwartz, and
  Noah A. Smith. 2021. Provable Limitations of Ac-        Anita Peti-Stantić, Maja And̄el, Vedrana Gnjidić, Gor-
  quiring Meaning from Ungrounded Form: What                dana Keresteš, Nikola Ljubešić, Irina Masnikosa,
  Will Future Language Models Understand? Trans-            Mirjana Tonković, Jelena Tušek, Jana Willer-Gold,
  actions of the Association for Computational Lin-         and Mateusz-Milan Stanojević. 2021. The croatian
  guistics, 9:1047–1060.                                    psycholinguistic database: estimates for 6000 nouns,
                                                            verbs, adjectives and adverbs. Behavior Research
Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex           Methods, pages 1–18.
  Warstadt, Clara Vania, and Samuel R. Bowman.
  2021. What ingredients make for an effective crowd-     Pouya Pezeshkpour, Sarthak Jain, Byron Wallace, and
  sourcing protocol for difficult NLU data collection       Sameer Singh. 2021. An empirical comparison of in-
  tasks? In Proceedings of the 59th Annual Meet-            stance attribution methods for NLP. In Proceedings
  ing of the Association for Computational Linguistics      of the 2021 Conference of the North American Chap-
  and the 11th International Joint Conference on Nat-       ter of the Association for Computational Linguistics:
  ural Language Processing (Volume 1: Long Papers),         Human Language Technologies, pages 967–975, On-
  pages 1221–1235, Online. Association for Computa-         line. Association for Computational Linguistics.
  tional Linguistics.
                                                          Alec Radford, Karthik Narasimhan, Tim Salimans, and
Nikita Nangia, Clara Vania, Rasika Bhalerao, and            Ilya Sutskever. 2018. Improving language under-
  Samuel R. Bowman. 2020. CrowS-pairs: A chal-              standing by generative pre-training. Unpublished
  lenge dataset for measuring social biases in masked       ms. available through a link at https://blog.
  language models. In Proceedings of the 2020 Con-          openai.com/language-unsupervised/.
  ference on Empirical Methods in Natural Language        Inioluwa Deborah Raji, Emily Denton, Emily M
  Processing (EMNLP), pages 1953–1967, Online. As-           Bender, Alex Hanna, and Amandalynne Paullada.
  sociation for Computational Linguistics.                   2021.    Ai and the everything in the whole
                                                             wide world benchmark. In Proceedings of The
Pandu Nayak. 2019. Understanding searches better             ML-Retrospectives, Surveys & Meta-Analyses @
  than ever before. The Keyword blog.                        NeurIPS 2020 Workshop.

Yixin Nie, Adina Williams, Emily Dinan, Mohit             Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-          Percy Liang. 2016. SQuAD: 100,000+ questions for
  versarial NLI: A new benchmark for natural lan-           machine comprehension of text. In Proceedings of
  guage understanding. In Proceedings of the 58th An-       the 2016 Conference on Empirical Methods in Natu-
  nual Meeting of the Association for Computational         ral Language Processing, pages 2383–2392, Austin,
  Linguistics, pages 4885–4901, Online. Association         Texas. Association for Computational Linguistics.
  for Computational Linguistics.
                                                          Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
                                                           and Sameer Singh. 2020. Beyond accuracy: Be-
Timothy Niven and Hung-Yu Kao. 2019. Probing neu-
                                                           havioral testing of NLP models with CheckList. In
  ral network comprehension of natural language ar-
                                                           Proceedings of the 58th Annual Meeting of the Asso-
  guments. In Proceedings of the 57th Annual Meet-
                                                           ciation for Computational Linguistics, pages 4902–
  ing of the Association for Computational Linguis-
                                                           4912, Online. Association for Computational Lin-
  tics, pages 4658–4664, Florence, Italy. Association
                                                           guistics.
  for Computational Linguistics.
                                                          Anna Rogers. 2021. Changing the world by changing
Stephen M Omohundro. 2008. The basic AI drives. In          the data. In Proceedings of the 59th Annual Meet-
  Artificial General Intelligence 2008, pages 483–492.      ing of the Association for Computational Linguistics
   IOS Press.                                               and the 11th International Joint Conference on Nat-
                                                            ural Language Processing (Volume 1: Long Papers),
Toby Ord. 2020. The precipice: existential risk and the     pages 2182–2194, Online. Association for Computa-
  future of humanity. Hachette Books.                       tional Linguistics.
You can also read