Evaluating Machines by their Real-World Language Use

Page created by Lori Ward
 
CONTINUE READING
Evaluating Machines by their Real-World Language Use

                                                            Rowan Zellers♠ Ari Holtzman♠ Elizabeth Clark♠
                                                                 Lianhui Qin♠ Ali Farhadi♠ Yejin Choi♠♥
                                             ♠
                                               Paul G. Allen School of Computer Science & Engineering, University of Washington
                                                                    ♥
                                                                      Allen Institute for Artificial Intelligence
                                                                          rowanzellers.com/advice

                                                     I have to do a dissection for my high school class, but I’m distressed by dead animals. Last time we
                                                     dissected an animal in class, I had a panic attack. I asked my teacher for another assignment, but
                                                     she refused. I don't want to play a 'victim' card, but I don't know what to do. Help!
arXiv:2004.03607v1 [cs.CL] 7 Apr 2020

                                         I’d send a short email to the next higher-             Go to your teacher and say "I'm asking you to
                                         up authority figure, ideally a counselor.              do a project that requires me to see dead
                                         Be forthright; it’s the best approach                  animals. This is a dealbreaker." If she doesn’t
                                         when self-advocating as a student.                     concede, tell your principal about your trauma.    T5
                                                                       Helpful                                                   Not helpful
                                                    Thanks!                                                  ??????

                                        Figure 1: TuringAdvice. Humans are natural experts at using language to successfully address situations that
                                        arise, such as giving advice to a friend (shown above). We introduce a new framework, dataset, and leaderboard
                                        to generatively evaluate real-world language use. Today’s most powerful models – which obtain near-human or
                                        superhuman performance on core NLP benchmarks for reading comprehension, natural language inference, and
                                        commonsense reasoning – struggle with all of these capabilities when generating advice, as highlighted in red.

                                                              Abstract                                 only 9% of cases. This low performance re-
                                                                                                       veals language understanding errors that are
                                            There is a fundamental gap between how hu-                 hard to spot outside of a generative setting,
                                            mans understand and use language – in open-                showing much room for progress.
                                            ended, real-world situations – and today’s
                                            NLP benchmarks for language understanding.             1   Introduction
                                            To narrow this gap, we propose to evaluate ma-
                                            chines by their success at real-world language         In October 2019, a team from Google surprised
                                            use – which greatly expands the scope of lan-          many in the natural language processing (NLP)
                                            guage tasks that can be measured and studied.          community by announcing T5, a new 11-billion pa-
                                            We introduce TuringAdvice, a new challenge
                                                                                                   rameter model pretrained on hundreds of gigabytes
                                            for language understanding systems. Given a            of language text (Raffel et al., 2019). Like many
                                            complex situation faced by a real person, a ma-        other large models released in the last few years,
                                            chine must generate helpful advice. We make            T5 showed impressive gains on a variety of NLP
                                            our challenge concrete by introducing Reddit-          benchmarks, adding to a growing list of “solved”
                                            Advice, a dataset and leaderboard for measur-          datasets on which machines outperform humans.
                                            ing progress. Though we release a training                Yet, when T5 generates language, we observe
                                            set with 600k examples, our evaluation is dy-
                                                                                                   clear gaps between machine-level and human-level
                                            namic, continually evolving with the language
                                            people use: models must generate helpful ad-           language understanding. Consider the example in
                                            vice for recently-written situations.                  Figure 1, in which a woman asks for advice. She is
                                                                                                   assigned to dissect an animal for her class project,
                                            Empirical results show that today’s models
                                            struggle at our task, even those with billions         but has extreme anxiety about dead animals – and
                                            of parameters. The best model, a finetuned             her teacher refused to give her another assignment.
                                            T5 (Raffel et al., 2019), writes advice that is        Humans can respond with helpful advice, reflecting
                                            at least as helpful as human-written advice in         our unique ability of real-world language use: to

                                                                                               1
communicate and tackle open-ended issues. The                to the advice-seeker as human-written advice.
helpful advice in this example – but not the only               We make our challenge concrete by introducing
one possible – suggests that she escalate the situa-         a new dataset, RedditAdvice, and accompanying
tion slightly by sending a short email to her guid-          leaderboard. We tie our dataset to the Reddit com-
ance counselor.                                              munity, which resolves two additional sources of
   On the other hand, not only is T5’s advice un-            bias. First, Reddit users are intrinsically motivated,
helpful, it also reveals key misunderstandings of            seeking advice about highly complex real issues –
the situation. It seems to believe that the student          which past work suggests differ from hypothetical
is asking the teacher to do a class project involv-          issues that crowd workers might come up with (e.g.
ing dead animals. This reading comprehension                 Kwiatkowski et al., 2019; Gurari et al., 2018). Sec-
error is particularly strange, as T5 outperforms             ond, we make our dataset and leaderboard dynamic,
humans on a variety of reading comprehension                 rather than static – evaluating models on Reddit sit-
benchmarks. Others in the community have ob-                 uations posted over the previous two weeks, at the
served similar issues, raising concerns about what           time of submission. Models therefore must tackle
today’s benchmark datasets measure (Yogatama                 the same language task as humans, generalizing to
et al., 2019; Kryscinski et al., 2019; McClelland            new situations and patterns of language.
et al., 2019; Gardner et al., 2019).                            Experimental results show that RedditAdvice is
   We argue that there is a deep underlying issue:           incredibly challenging for today’s machines. To-
a gap between how humans use language in the                 day’s largest model, T5, with 11 billion parameters
real world, and what our evaluation methodology              (Raffel et al., 2019), produces advice that is prefer-
can measure. Today’s dominant paradigm is to                 able to human-written advice only 9% of the time
study static datasets, and to grade machines by the          – after being finetuned for our task, on a training
similarity of their output with predefined correct           dataset with 600k pieces of advice. What’s more,
answers. For example, we score multiple choice               our experimental setup finds statistically significant
exams by how often the correct answers are chosen,           differences between current models, allowing us to
and evaluate generative tasks like machine trans-            meaningfully grade varying levels of performance.
lation by similarity with respect to correct transla-           We also study our task from the perspective of to-
tions. However, when we use language in the real             day’s standard ‘core’ NLP tasks. Broadly, we find
world to communicate with each other – such as               that machines frequently confuse who is who, are
when we give advice, or teach a concept to some-             self-contradictory, or seem to miss important world
one – there is rarely a universal correct answer to          knowledge. However, these mistakes tend not to
compare with, just a loose goal we want to achieve.          fall into the neat categories defined by standard
                                                             task definitions. We address this by introducing di-
   We introduce a framework to narrow this gap
                                                             agnostics questions, which systematically measure
between benchmarks and real-world language use.
                                                             these language understanding errors.
We propose to evaluate machines by their success
                                                                In summary, our paper makes three major con-
in using language to (1) communicate with humans
                                                             tributions. First, we introduce a new framework
in (2) tackling complex, open-ended, real-world
                                                             for measuring language understanding through di-
situations. Our goal is a machine that, like a human,
                                                             rectly tackling real-world language problems. Sec-
can generate language that is useful and helpful.
                                                             ond, we introduce TuringAdvice as a new chal-
Doing so necessarily requires a deep understanding
                                                             lenge for AI systems, along with a dynamic dataset
of language and the world, as per a line of thought
                                                             and leaderboard. Third, we connect our task to
that the complete meaning representation is one
                                                             existing atomic language understanding tasks, in-
that suffices to complete a task (Artzi et al., 2013).
                                                             troducing a new setting that reveals areas where
  As a case-study of our framework, we introduce             progress is still needed.
TuringAdvice  as a new grand challenge for AI sys-
tems. A machine reads a situation written by a               2   Real world language use
person seeking advice, like Figure 1, and must then
write advice that is helpful to the advice-seeker.           Our key proposal is to evaluate machines by their
Like a Turing Test (Turing, 1950), we establish a            success at real-world language use: using language
simple condition required for a model to ‘pass’:             to communicate with a human, in response to a
model-generated advice must be at least as helpful           naturally occurring situation, in order to achieve

                                                         2
a desired outcome. Our approach is inspired by               though these attributes might be desirable, they are
Wittgenstein’s notion of semantics, that “meaning            often not sufficient to guarantee task success.
is use”: language is grounded in our desire to make             Correctness. The second (and perhaps more
sense of one another and cooperate to meet our               common) approach is to evaluate tasks through
needs (Wittgenstein, 1953).                                  correctness over static datasets. For example, ma-
   As machines do not have humanlike needs or                chines can be graded by the similarity of their gen-
desires, we propose to evaluate machines’ success            erated translation to correct translations,1 or, by
at a task by how well it serves a human who is               how often they choose the correct answer on a mul-
interested in the outcome. For example, if a ma-             tiple choice exam. Many goal-oriented dialogue
chine orders food on my behalf, then I can evaluate          and semantics tasks are also evaluated in this way,
it based on whether I enjoy the dish it ordered.             as a model is evaluated by whether it makes the
Though this requires careful task selection in order         correct API call, or produces a correct parse.
to make things feasible for current models, as we               Since many language tasks cannot be evaluated
will show in Section 3, it results in a powerful and         through correctness, researchers often introduce
reliable human evaluation.                                   proxy tasks that are easy to evaluate, while (hope-
                                                             fully) correlating with the underlying true task. For
2.1     Related work                                         example, SWAG (Zellers et al., 2018) is a multiple-
2.1.1    Pragmatics in NLP                                   choice proxy task and dataset introduced to study
                                                             the true task of commonsense reasoning.
Our evaluation relates to pragmatics in NLP, where
communication is modeled also through listeners                 However, there are gaps between datasets for
and speakers (Golland et al., 2010; Frank and Good-          proxy tasks (e.g. multiple choice), and the core
man, 2012). One approach is to introduce a com-              tasks they seek to represent (e.g. commonsense
munication game, with an explicit objective. For             reasoning), which we discuss in the next sections.
example, Wang et al. (2016) study a blocks world
where humans must build a structure by giving                2.2    Can language use really be measured
commands to a block-placing machine. The ma-                        through correctness over proxy tasks?
chine is then graded on accuracy. Our proposed               When we reduce a complex language task to a
evaluation instead covers complex everyday sce-              simplified setup, with a small label space (like
narios faced by a human, where the objective is to           multiple-choice classification), we run the risk of
help them as much as possible.                               introducing artifacts and biases: patterns that can
   Pragmatics can also be studied through machine-           be exploited in the simplified setup, but that are not
machine communication; e.g., through emergent                representative of the true task (Gururangan et al.,
language (Lazaridou et al., 2017). Recent work               2018; Zellers et al., 2019a). Artifacts can enable
uses pretrained question-answering models to eval-           machines to even outperform humans at the final
uate summarization models (Chen et al., 2018;                benchmark, without solving the underlying task.
Scialom et al., 2019; Eyal et al., 2019; Vasilyev               While the problem of artifacts has recently taken
et al., 2020). However, ensuring that machines               the spotlight in the NLP community, partially be-
communicate in standard English is difficult, as             cause large Transformer models (Vaswani et al.,
there is usually a more efficient machine-language           2017) are very good at picking up on artifacts,
coding scheme for the task (Kottur et al., 2017).            there is a deeper underlying issue. The key as-
                                                             sumption behind a simplified language task is that,
2.1.2    Two major approaches for evaluation                 by correctly mapping from inputs X to labels Y, a
Today, we see two major approaches for model                 machine must necessarily learn a set of attributes
evaluation, which we discuss below.                          A that are also representative of the ‘true’ task.
   Quality of generations. The first approach stud-          We can upper-bound the information contained by
ies generative tasks like chit-chat dialogue or story-       A through the information bottleneck principle of
writing, and measures the inherent quality of gen-           Tishby et al. (1999). An efficient model minimizes
erations, often through individual attributes such
                                                                 1
as “sensibleness” and “specificity” (e.g., Venkatesh               Models submitted to the 2019 Conference on Machine
                                                             Translation were evaluated (by humans) on how well the
et al., 2018; Hashimoto et al., 2019; Adiwardana             model’s translations agreed with either (1) human-written
et al., 2020). This approach is orthogonal to ours:          translations, or, (2) original source text (Barrault et al., 2019).

                                                         3
the following equation, for some β ą 0:                      historic patterns of language. For instance, in our
                                                             field, we commonly evaluate syntactic understand-
              min IpX; Aq ´ βIpA; Yq,             (1)        ing using the Penn Treebank dataset, which con-
             ppa|xq
                                                             tains news articles from 1989 (Marcus et al., 1993).
where I is mutual information. In other words, the           However, the world is constantly evolving, along
model will learn attributes A that maximally com-            with the language that we use.
press the inputs X (minimizing IpX; Aq), while also             To bridge this gap, we propose to evaluate ma-
remaining good predictors of the labels Y (max-              chines by their interactions with humans in the
imizing IpA; Yq). However, the label prediction              present. Models therefore must learn to perform
term is bounded by the information (or entropy, H)           the underlying language tasks, even for novel situa-
of the label space:                                          tions, rather than fitting to the historic distribution
                                                             of a fixed test set. We make this notion concrete
       IpA; Yq “ HpYq ´ HpY|Aq ď HpYq.            (2)        in the next section, where we introduce a dynamic
                                                             dataset and leaderboard for evaluating advice.
This means that an efficient model, trained on a task
with a small label space, might have attributes with         3 TuringAdvice: a new challenge for
low information content. This phenomenon has                   natural language understanding
been observed empirically, with deep models itera-
tively discarding information at each layer (Tishby          As a case study of our framework, we introduce
and Zaslavsky, 2015).                                        TuringAdvice, a new challenge task for AI systems
   If we wish to evaluate language understanding             to test language understanding. The format is sim-
via proxy tasks, then the information-discarding             ple: given a situation expressed in natural language,
strategy of efficient models poses an issue. Models          a machine must respond with helpful advice. To
are encouraged to forget linguistically useful infor-        pass the challenge, machine-written advice must
mation that is not directly relevant to predicting Y         be at least as helpful to the advice-seeker as human-
(Pereira, 2000). In fact, models might exclusively           written advice, in aggregate.
learn the artifacts in the data, provided that these            We choose to focus on advice for a few reasons.
artifacts are enough to solve the task.                      First, people ask for and give advice as a part of
   An alternate approach is to make datasets harder          their daily lives, encompassing settings as diverse
adversarially, so as to have fewer artifacts (Zellers        as relationship advice and tech support (Bonaccio
et al., 2018, 2019a; Le Bras et al., 2019). However,         and Dalal, 2006). This means that we as humans
it might be impossible to make a dataset with no             have inherent familiarity with the task, and what it
artifacts, or to know if one has been created.               means for advice to be helpful. Thus, as we will
   Our proposal, to evaluate models through                  later show empirically, advice is easy to evaluate
their real-world usage of language, addresses the            through how much it helps the advice-seeker – even
information-discarding issue in two ways. First, by          though it is highly diverse.2
using real-world language over open-ended tasks,                Second, giving advice overlaps with core NLP
the mapping between possible inputs and outputs              tasks, such as reading comprehension and natural
is allowed to be highly complex. For example, the            language inference (Section 5.4). We hypothesize
space of possible advice is vast, and many pieces            that generating advice that truly helps someone
of advice might be equally helpful given a situation.        requires a deep understanding of their situation.
Second, our proposal tackles language problems                  Third, good advice is important to people. Its
directly, without introducing a correctness-based            importance has even led to the creation of internet
proxy that machines might overfit to.                        communities oriented around advice, making data
                                                             plentiful. Likewise, we hypothesize that progress
2.3   Static datasets in a dynamic world                     on TuringAdvice might have high impact for good.
To evaluate performance on a real-world task by              An AI capable of writing consistently helpful ad-
means of a dataset, we must (implicitly) assume              vice – perhaps, a virtual therapist (DeVault et al.,
that the dataset is a good representation of the world       2014) – could greatly help people in need.
(Torralba and Efros, 2011). This assumption might               2
                                                                  An advice-giver can recommend or advise against a par-
be questionable when it comes to real-world lan-             ticular action, they can provide information about options, and
guage use, as static datasets necessarily capture            they can offer support (Dalal and Bonaccio, 2010).

                                                         4
reddit      RELATIONSHIP_ADVICE                                                                         tailed.5 See Figure 2 for an example. Key to the
         My (23M) boyfriend (25M) of 8 months birthday present to me
 13k was a tattoo of my name and face on his back.                                                         functioning of this online advice community is that
points
         by AdviceSeeker 1 week ago:
         I've been together with my BF (we'll call him Kyle) for a little over 8 months now. We
                                                                                                           users want to participate and are thus intrinsically
         don't live together but he only lives about a 5 minute walk from me. I would have
         described the relationship before this week as pretty slow. Neither of us really wanted           motivated – situation posters need advice, repliers
         any big commitments yet so outside of date nights, netflix and occasional hook ups the
         relationship has been pretty laid back.
         That was until this last weekend. My birthday was Saturday and we were having a
                                                                                                           desire to have their advice recognized, and read-
         party with about 9 people. Kyle made a big show about getting everyone together
         because he wanted to give me her present in front of everyone. Well, this is where                ers enjoy passing judgement on such advice (Chiu
         things get crazy. For my "birthday present", Kyle got a MASSIVE tattoo on his back. Of
         my face. Underneath my face there is text saying "Mine forever". The silence was
         deafening, it didn't help that the tattoo was not even half done.
                                                                                                           et al., 2006; Wasko and Faraj, 2005).
         This is completely out of my comfort zone and I have no clue what to do. My sister has
         been telling me just to break up with him and ignore him. But I just can't do that.
         Before Saturday I did feel a spark with him. I did like him a lot. But this is all just way
         to much. Any advice on what I should or can do here would be appreciated.
                                                                                                           3.1.2    The ideal evaluation - through Reddit?
top reddit advice:
          By advicegiver1 1 week ago:
                                                                                                           In a sense, human advice-givers are ‘evaluated’ on
  9k
points You gotta at least talk to him and tell her why everyone reacted like that did.
          "I feel like the gift was making a huge commitment that we hadn't actually discussed
                                                                                                           Reddit by the score of their advice – representing
          yet. We aren't married, engaged or even living together so I'm not sure why you
          thought this was a good idea. I'm sure you only had good intentions, but I'm not                 how well their advice has been received by the
          prepared for the type of commitment that tattoo entails."

       By advicegiver2 1 week ago:
                                                                                                           community. Similarly, the ideal model evaluation
  6k
points Not sure what you can do... he’s waving the reddest of red flags here.                              might be to post advice on Reddit directly. If the
                                                                                                           model consistently understands written situations –
Figure 2: An example situation, along with two pieces
                                                                                                           enough to produce helpful advice – then its advice
of top-scoring community authored advice. A machine
must generate advice that is at least as helpful to the                                                    will in turn be consistently upvoted.
advice-seeker as the reference advice.                                                                        However, there is a significant ethical problem
                                                                                                           with this approach. The users who post advice
                                                                                                           questions are real people, with real problems. A
3.1       RedditAdvice: A dynamic dataset for                                                              user might read advice that was originally written
          evaluating advice                                                                                by a machine, think it was human-endorsed, and
We propose to evaluate models dynamically,                                                                 do something harmful as a result.6 For this reason,
through new situations and advice that are posted                                                          we take an alternate crowdsourcing approach.
to Reddit. We call our dynamic dataset Reddit-
Advice. Many of Reddit’s subcommunities (or                                                                3.1.3    A crowdsourced, hybrid evaluation –
‘subreddits’) are devoted to asking for and giv-                                                                    through Mechanical Turk
ing advice, with subreddits for legal, relationship,                                                       We propose a hybrid approach for dynamic evalua-
and general life advice.3 During evaluation time,                                                          tion of models. While the situations, and reference
we will retrieve new situations from Reddit as a                                                           advice come from Reddit, we hire workers on Me-
new test set for models. Workers on Mechanical                                                             chanical Turk to rate the relative helpfulness of
Turk then grade the model-written advice versus                                                            machine-written advice. Not only is this format
the Reddit-endorsed human-written advice.                                                                  more ethical, it also lets us collect diagnostic rat-
                                                                                                           ings, allowing us to quantitatively track the natural
3.1.1 How advice-giving works on Reddit
                                                                                                           language understanding errors made by machines.
Suppose a Reddit user faces an issue that they are                                                            One possible concern, however, is that crowd
seeking advice about. First, they write up their situ-                                                     workers might be more extrinsically motivated –
ation. The writing is typically detailed, and usually                                                      performing our task to earn income, as opposed to
includes a question (often implicitly). They then                                                          the Reddit users who are intrinsically motivated.
post their situation to an advice-oriented subreddit.                                                      To address this, we made our crowdsourcing task
Users who follow that subreddit then reply to the                                                          as fulfilling as possible: using popular situations
situation, offering advice.4                                                                               from Reddit, and pitching the work in terms of
   Importantly, any user can ‘upvote’ or ‘downvote’                                                        helping people. We received feedback from many
the advice as well as the situation itself - changing                                                      workers that our tasks were entertaining and fun.
its score slightly. Top-scoring advice is deemed by                                                        This suggests that our workers are intrinsically mo-
the wisdom of the crowd as being the most helpful,
                                                                                                               5
while top-scoring situations are often the most de-                                                              This is somewhat of a simplification, as other factors also
                                                                                                           influence what gets upvoted (Anderson et al., 2012; Lakkaraju
   3
     We use advice from the following subreddits: Love,                                                    et al., 2013; Muchnik et al., 2013; Jaech et al., 2015).
Relationships, Advice, NeedAdvice, Dating_Advice, Dating,                                                      6
                                                                                                                 One alternative might be to post advice, but add a dis-
Marriage, InternetParents, TechSupport, and LegalAdvice.                                                   claimer that the advice was AI-written, and so should be taken
     4
     Users on Reddit can also reply to the replies themselves,                                             with a grain of salt. We tried posting advice in this way, for
in a hierarchical way. For simplicity however, we don’t incor-                                             a filtered set of non-volatile situations where no one was at
porate these nested replies in our dataset.                                                                imminent risk, but subreddit moderators banned our account.

                                                                                                       5
Given:       Situation           Advice A          Advice B               disqualifying workers who tend to choose machine-
                                                                           written over human-written advice.
 1. Which piece of advice is more helpful?
  Definitely A       Slightly A       Slightly B      Definitely B         3.2      A large static dataset for training
                                                                           We additionally present RedditAdvice2019, a large
 2. How helpful is the worse advice (A) to the question-asker?
                                                                           static dataset for training advice-giving models. Be-
  Slightly helpful          Not helpful             Dangerous
                                                                           cause today’s models have extreme reliance on data
                                                                           for finetuning, we collect data that is in the exact
 3. Is Advice A worse         3. Could Advice A be applicable to
 mainly due to its            (and helpful in) a different situation?       same format as RedditAdvice, yet we expand our
 meaning, or its writing?                                                  selection criteria, optimizing for recall rather than
  Meaning       Writing        Possibly helpful     Never helpful
                                                                           precision (Appendix A.2). In total, we extract 616k
                                                                           pieces of advice, over 188k situations.
Figure 3: Crowdsourcing workflow. Workers on Me-                              To mirror the dynamic nature of the evaluation,
chanical Turk are given a situation, and two pieces of                     in which models are evaluated on situations posted
advice. First, they choose which is most helpful – in                      in 2020 and beyond, we split our dataset into static
this example, B is selected. Second, they rate the help-                   training and validation sets by date.8 We trained
fulness of the worse advice (A); last, they answer an
                                                                           models on the training set, and chose hyperparame-
additional diagnostic question that depends on whether
A was rated Slightly helpful or not.                                       ters using perplexity over the validation set.

                                                                           4       Experimental results on RedditAdvice
tivated, and thus are good judges of language use –
                                                                           In this section, we report results from one round of
a finding we confirm empirically in Section 4.1.1.
                                                                           dynamic evaluation on RedditAdvice. We evaluate
3.1.4 Mechanical Turk annotation setup                                     the following selection of NLP models:
In a single round of evaluation, we retrieve 200                           a. Grover (Zellers et al., 2019b): a left-to-right
popular Reddit situations that were posted in last                             transformer model. Grover was pretrained on
two weeks.7 For each situation, we retrieve the                                news articles with multiple fields, perhaps mak-
top-rated human-written advice, and generate one                               ing it a good fit for our task, with multiple fields
piece of advice per model. Workers on Mechanical                               of context (like the subreddit, date, and title).
Turk then compare the helpfulness of the model-                                Sizes: We study the two largest Grover models:
generated advice with human-written advice, and                                Grover-Large, with 0.3 billion parameters, and
provide diagnostic ratings.                                                    Grover-Mega, with 1.5 billion parameters.
   We show an overview of our Mechanical Turk                              b. T5 (Raffel et al., 2019): a sequence-to-
task in Figure 3. A worker is given a situation,                               sequence model with a bidirectional encoder
as well as two pieces of advice: A and B. One is                               and a left-to-right generator. T5 was trained on
the top-scoring advice from Reddit, and the other                              a large dataset of cleaned web text. At the time
is model-generated advice; the worker is not told                              of writing, T5 is the top-scoring model on the
which is which. The worker first chooses the more                              Glue and SuperGlue benchmarks (Wang et al.,
helpful piece of advice, then provides diagnostic                              2019b,a), scoring above human performance on
information for the less helpful advice – rating it                            Glue (90 vs. 87) and near human-performance
 Slightly helpful , Not helpful , or Dangerous . If                            on SuperGlue (89.3 vs 89.8).
the worse piece of advice was Slightly helpful ,                               Sizes: We study the two largest T5 models. As
they choose whether it is worse than the bet-                                  their names suggest, T5-3B has 3 billion param-
ter advice due to a Meaning problem or a                                       eters, and T5-11B has 11 billion parameters.
 Writing problem . Otherwise, they choose if the
                                                                           c. TF-IDF retrieval: we additionally consider a
worse advice could be Possibly helpful in some
                                                                               simple baseline built around retrieval, not gen-
other situation, or Never helpful in any situation.
                                                                               eration. We first precompute bag-of-word TF-
   Overall, three workers rate each model-situation
                                                                               IDF vectors for all situations in the training set.
pair, and their ratings are combined using a major-
                                                                               Given a new situation, we compute its TF-IDF
ity vote. We follow best practices for Mechanical
                                                                               8
Turk, including using a qualification exam, and                                 Our training set contains 600k pieces of advice from July
                                                                           2009 to June 14, 2019; validation contains 8k from June 14 to
   7
       See Appendix A.1 for information about the selection.               July 9th 2019.

                                                                       6
% Frequency that model advice is preferred over best Reddit advice
100%                                                                                                                         TF-IDF
                                                                                                                            Retrieval                1.5%         2%*         4%*     7%*         38%*

        Model advice preferred
                                                                                                                             Grover-                              .5%*        2.5%*   5.5%*     36.5%*
 80%                                                                                                                      Large(.3B)

                                                                                                       Baseline model
                                                                                                                            Grover-                                           2%      5%*         36%*
                                                                                                                         Mega(1.5B)
 60%
                                                                                      40.0%                                    T5-3B                                                   3%         34%*
        Reddit advice preferred

 40%
                                                                                                                             T5-11B            *: Significant with p < .01                        31%*
                                                                                                                                                 : Significant with p < .05
                                                                                                                         Second-best
 20%                                                                      9.0%                                          Reddit advice          Not significant
                                                       4.0%       6.0%
                                   2.0%      3.5%                                                                                        TF-IDF      Grover-    Grover-       T5-3B   T5-11B   Second-best
                                                                                                                                        Retrieval   Large(.3B) Mega(1.5B)                      Reddit advice
 0%                                                                                                                                                              Compared model
                                   TF-IDF Grover- Grover-         T5-3B   T5-11B   Second-best
                                  Retrieval Large(.3B) Mega(1.5B)                  Reddit advice
                                                                                                          Figure 5: Improvement (in absolute percentage %) be-
Figure 4: Helpfulness of evaluated models, relative to                                                    tween pairs of models, along with statistical signifi-
top-scoring Reddit advice. We show results over 200                                                       cance as measured by a paired t-test. The improvement
shared situations; we also show bootstrapped 95% con-                                                     of large models over smaller ones is highly significant,
fidence intervals. Advice from the biggest model, T5-                                                     such as T5-11B over Grover-Mega (5% gap, pă.01).
11B, is preferred 9% of the time over Reddit advice.

     vector and retrieve the most similar situation                                                     T5 with 11 billion parameters, scores 9%. Other
     from the training set. We then reply with the                                                      models, with fewer parameters, do worse—with
     top-scoring advice for that situation.                                                             Grover-Large (0.3B parameters) scoring 3.5%. In
Last, to quantify the measurement error of our eval-                                                    comparison, the second-highest scoring Reddit ad-
uation, we additionally evaluate:                                                                       vice scores 40%, and the highest scoring advice is
d. the second-highest rated Reddit advice for each                                                      (by definition) 50%. However, in theory, a model
     situation. We send this advice through the same                                                    could score above 50%, if it writes advice that is
     pipeline as machine-written advice.                                                                truly helpful and thus gets consistently chosen.
   We train our models using cross-entropy, and
                                                                                                        4.1.1 Measurement error
generate using Nucleus Sampling (Holtzman et al.,
2020). We provide additional training and genera-                                                       To investigate the measurement error of our eval-
tion details for our models in Appendix B.                                                              uation, in Figure 5 we report the statistical sig-
   In our study, we do not consider purely bidirec-                                                     nificance between pairs of models; details about
tional models, such as BERT (Devlin et al., 2019).                                                      how this is computed are in Appendix C. For pairs
While these models can be adapted to generate text,                                                     of models with a greater difference in parameter
their generations are generally worse than those of                                                     count, we similarly see a large (and statistically sig-
left-to-right models (Wang and Cho, 2019); more-                                                        nificant) difference in performance. For instance,
over, T5 tends to outperform these models even on                                                       while the improvement of T5-11B over T5-3B was
discriminative tasks. We also do not consider GPT                                                       3%, and not found to be statistically significant, the
(Radford et al., 2018), another left-to-right model,                                                    improvement of T5-11B over Grover-Mega was
as to make it controllably generate advice would                                                        5% and highly statistically significant.
involve more changes in finetuning, versus Grover.                                                         Overall, the statistical significance results sug-
                                                                                                        gest that our evaluation can stably rank model per-
4.1     Quantitative results                                                                            formance. This, along with the finding that model
In Figure 4, we show overall results for one evalua-                                                    performance is low on our task (ď9%) suggests that
tion trial, which featured 200 situations posted on                                                     there is ample room for growth on RedditAdvice.
Reddit from February 1st to February 12, 2020. As
                                                                                                         5                  Analysis and discussion
a key metric for measuring the relative usefulness
of model-written advice, we evaluate the frequency                                                       So far, we have shown that we are able to reli-
by which workers prefer the Reddit-written refer-                                                        ably evaluate models in our dynamic setup, and
ence advice over the model-written advice. If a                                                          that doing so results in model performance that is
model’s advice was just as helpful as human advice                                                       significantly lower than human performance.
in aggregate, then that model would score 50%.                                                              To break down what this gap in performance
   Model performance is quite low. The best model,                                                       means, we show a qualitative example in Figure 6,

                                                                                                   7
legaladvice:   Employer just informed me I was never put on payroll and "don’t have a w2"
   Very unusual situation. My former employer who laid all of us off suddenly in mid November is just today telling me that
   I was never "entered into payroll" and don’t have a w2. He is suggesting that instead he can write me a check for the taxes
   already withheld from my pay. This immediately rang alarm bells because I’m not an idiot. I said replied neutrally that I
   needed time to make a decision. Here’s the thing, throughout my employment there I was not provided with paystubs
   with each paycheck. I repeatedly asked for my paystubs and it was always blamed on "the bookkeeper". I looked it up
   and the state I’m in does not require employers to provide paystubs. Fine. But when I was laid off I wanted to apply for
   unemployment and needed a final paystub. After almost a month of daily texts I finally got a final paystub from him,
   reviewing it now I had almost $1500 withheld in taxes from a gross pay of around $2500 over a period of about 3.5
   months. This amount of tax now seems excessively high AND if I "have no w2" does that mean my employer withheld
   taxes from my pay and did not actually pay that money forward as taxes. What do I need to do here to both stay in the
   clear tax wise AND potentially recover the money withheld perhaps wrongly?
   Top Reddit     If you live in a state with a labor board or labor commission start there. This is a big deal and they will be
   advice         able to provide the most assistance. If your state does not have that, you really might want to call your
                  local district attorney and see if anyone does payroll fraud. After that, the IRS might actually be able to
                  help, they have records on the other side for you pay, if your boss withheld and sent in the tax payment
                  which I doubt.
                  You are actually the victim of a crime here, not just a lazy accounting. Your employer could feed all the
                  payroll info into business software and get it all straightened out, but they aren’t
                  I am not recommending a CPA or tax person because you’re going to spend half your total pay on them,
                  as your total income is well below the free tax filing software threshold.
   TF-IDF re-     This is a company that is big enough to have a payroll department and they did this? Jesus - I don’t know
   trieval        where to even start with this. Swalsh is right - it’s not only illegal it’s potentially jail time illegal.
                   De nitely less helpful Not helpful Possibly helpful in a different situation

   Grover-        Is there a method by which you can "dismiss" this information as false? If so, dispute it with the IRS.
   Large           De nitely less helpful Not helpful Possibly helpful in a different situation

   Grover-        If you can prove that they never actually paid you for work you did, you can sue them for that money.
   Mega            De nitely less helpful Not helpful Possibly helpful in a different situation

   T5-3B          Is your employer US based? Because if they were (you’ll need to make a claim to the IRS), then you
                  would have had a W2 issued to you with the taxes withheld from your pay, at minimum. If you were truly
                  not on payroll, you still have a W2 (since you were still earning money), but that’s probably for taxes
                  withheld from your paychecks as well as any state and federal income taxes that were withheld. Neither
                  of those numbers look good.
                   De nitely less helpful Not helpful Never helpful for any situation

   T5-11B         > I have no w2 does that mean my employer withheld taxes from my pay and did not actually pay that
                  money forward as taxes?
                  Yes, your employer should have withheld your taxes, reported your income correctly, and paid them into
                  the IRS, if you are required to do so.
                  Here’s the thing though: If you get a W2 from your employer, then it’s time to file your taxes for 2019.
                  You should try and use the W2 you just got from your former employer to file.
                   De nitely less helpful Not helpful Possibly helpful in a different situation

   Second-        So it definitely is not quite adding up. Talk with a CPA, they can probably guide you through dealing with
   best Reddit    this. They would be able to keep you in the clear.
   advice         It probably won’t be free but shouldn’t be too costly. Do this soon because they get bombarded with work
                  between now and April so keep that in mind.
                  Hiring a lawyer or going after them legally doesn’t seem worth it. And CPA’s are certified to deal with the
                  IRS on your behalf if they decide to come calling or asking questions.
                   Slightly less helpful Slightly helpful Meaning problem

Figure 6: A qualitative example; more are in Appendix E. Though machine-generated advice matches keywords
from the situation, upon a close read, it is frequently not helpful or even self-contradictory. The issues are due to
critical errors in natural language understanding, such as reading comprehension, entailment, and coreference.

                                                                 8
100%
                                                                          Frequency (%) of advice ratings
 80%                         84%              Preferred over top-rated Reddit advice            Not helpful/Dangerous     possibly helpful
                                              Slightly helpful writing problem                  Not helpful/Dangerous     never helpful
 60%
                                              Slightly helpful meaning problem
 40%                                                                                                        40%
                                                           32% 29%
 20%                                                                                            26%               24% 26%
                                                                                    21% 22% 21%
                                               15% 19%
                                                                              9%
  0%        2%    5% 5%             3%   4%                                                                                   3% 6%
                 TF-IDF Retrieval             Grover-Mega (1.5B)                      T5-11B                 Second-best Reddit advice

Figure 7: Distribution of ratings for three evaluated models: the retrieval baseline, Grover-Mega, and T5-11B;
along with ratings for the second-best rated Reddit advice. Though generative models like Grover and T5 are
preferred more often than the TF-IDF retrieval baseline; they also often struggle to generate coherent advice. Of
note, 31% of the advice from T5 would never be helpful in any situation, versus 4% from the retrieval model.

describing a wage theft situation. The top-rated                       First, stronger generators improve over weaker gen-
Reddit advice understands the situation, and then                      erators: in comparing T5-11B to the weaker ma-
offers helpful assistance. It recommends the advice-                   chine model, Grover-Mega, we find it tends to pro-
seeker contact their state labor board or district at-                 duce less ‘Not helpful/Dangerous’ advice. 26% of
torney, noting that they are “the victim of a crime.”                  T5-11B’s advice is Never helpful in any situation,
Meanwhile, machine advice often misses the heart                       versus 29% for Grover-Mega; 21% is unhelpful
of the issue: T5-11B, for example, suggests that the                   but could be Possibly helpful in another situation,
advice-seeker simply file their taxes. Even worse,                     versus 32% for Grover-Mega.
T5-3B is self-contradictory, saying that “if you                          Second, and perhaps most surprising, we find
were truly not on payroll, you still have a W2.”                       that all generators frequently commit natural lan-
                                                                       guage understanding errors during generation, in-
5.1    Problems with machine-written advice                            cluding internal contradiction. Because of this,
As part of our evaluation, we wish to quantita-                        we find that our simple baseline – TF-IDF bag-of-
tively measure problems with machine-written ad-                       words retrieval – is competitive with that of deep
vice. Recall that in our crowdsourcing setup (Sec-                     generators with billions of parameters. While its
tion 3.1.3), we ask workers to not only select which                   advice is often irrelevant (84% of the time), it is al-
advice is better, but also to annotate problems with                   most never complete gibberish - since it is retrieved
the worse piece of advice. We find workers have                        from top-scoring advice. In fact, very few (3%) of
high agreement throughout the diagnostic annota-                       workers rated this advice as Never helpful in any
tion process; moreover, we use three workers per                       situation, versus 26% for T5.
advice for additional consistency.9
   In Figure 7, we show the distribution of                            5.2         A Leaderboard for Advice Evaluation
the ratings for model-written, versus human-                           So far, we have presented the results from one
written advice. Machine-written advice that was                        round of a dynamic evaluation. We propose to
not preferred over human-written advice can                            keep that evaluation ongoing, through a dynamic
have the following ratings. It can be rated as                         leaderboard at rowanzellers.com/advice.
 Slightly helpful (but, was rated as worse mainly                         At the time of writing, the leaderboard works as
due to a Meaning problem or Writing problem ),                         follows. Users submit a model API to be dynam-
or, as either Not helpful or Dangerous (and it                         ically evaluated. The new model, along with the
could be Possibly helpful in some other situation,                     highest rated previously-evaluated model, will be
or Never helpful in any situation).10                                  evaluated for an additional round - using a new set
   The diagnostics show several interesting patterns.                  of 200 situations posted over the last two weeks,
    9
      For the classifying machine-written advice as ‘helpful’
                                                                       using the same approach as in Section 3.1.3.
versus ‘not helpful’ or ‘dangerous’ (combining the two latter             One potential concern, however, is price. In our
categories into one), we have κ“0.689. For breaking down               Mechanical Turk workflow, we paid workers 52
helpful advice into ‘meaning problem’ versus a ‘writing prob-
lem’, we have Cohen’s κ“0.646; for rating unhelpful advice             cents per HIT.11 After the Mechanical Turk fee,
as ‘possibly helpful’ versus ‘never helpful,’ we have κ“0.636.         and using 3 workers per piece of advice, this costs
   10
      We found workers rarely chose Dangerous (2%), so for
                                                                         11
ease of visualization, we combined it with Not helpful .                      We chose this to pay workers at least $15 per hour.

                                                                   9
1
            10
                                       RedditAdvice situation                  RedditAdvice advice
                 2                     HellaSWAG                                                          raising doubt that further progress on them brings
            10                         Glue
                                                                                                          us closer to human-level language understanding.
Frequency
                                       SuperGlue
                 3
            10                                                                                               We argue two things: first, that many NLP tasks
            10
                 4                                                                                        are necessary components of giving advice, and sec-
            10
                 5                                                                                        ond, that because giving advice remains far from
                     0     250   500     750      1000     1250 0   200     400        600                solved, these tasks are also far from solved. In
                            Length (spaCy tokens)                   Length (spaCy tokens)
                                                                                                          Appendix E, we study problems with advice from
 Figure 8: Length distribution of RedditAdvice, com-                                                      T5-11B from the point of view of existing NLP
 pared with other common NLU benchmarks bench-                                                            tasks. For instance, machine advice often contra-
 marks (HellaSWAG; Zellers et al. (2019a), GLUE;
                                                                                                          dicts itself, suggesting that today’s systems struggle
 Wang et al. (2019b), SuperGlue; Wang et al. (2019a)).
 The examples in RedditAdvice are significantly longer,                                                   with the general task of natural language inference.
 representing highly complex situations.                                                                     One interesting line of research would be to
                                                                                                          transfer knowledge from supervised NLP datasets
                                                                                                          into a generative setting.13 However, one difficulty
 $1.86 per piece of advice, or $372 for 200 pieces                                                        is that datasets are necessarily curated – in terms
 of advice. We argue that this cost should not pro-                                                       of both the label space as well as the data dis-
 hibit a human evaluation, particularly when com-                                                         tribution. Paragraphs of machine-written advice,
 pared with the electricity cost of model develop-                                                        that exhibit many kinds of language understanding
 ment (Strubell et al., 2019). To ensure that the                                                         errors, might be significantly out-of-distribution.
 cost is fairly distributed among the community, we                                                          We propose another way forward. Predicting the
 propose that submitters to our leaderboard pay the                                                       advice ratings themselves (such as whether advice
 Mechanical Turk bill.12                                                                                  could ever be helpful) is itself a language task, one
                                                                                                          that might provide signal for better generation.14
 5.3                     Length and complexity
                                                                                                             Overall, this suggests that evaluating machines
 One interesting aspect of RedditAdvice is that its                                                       by their language use might lead to progress on
 situations are long and complex. We show a dis-                                                          existing NLP tasks in two ways. First, by studying
 tribution of lengths in Figure 8. Not only are the                                                       a generative setting, we necessarily adopt a broad
 inputs and outputs complex, but we argue that the                                                        and inclusive definition of the task at hand; sec-
 mapping between them is complex as well: it is                                                           ond, we can turn common challenges into small
 necessarily one-to-many, as there might be many                                                          discriminative tasks to study further.
 possible kinds of good advice for a situation.
    We believe evaluating by task success – how                                                           5.5   How can we build models that are better
 much the advice helps a user – is the key reason                                                               at giving advice?
 why a task like advice-giving can be scaled up.                                                          Over the last few years, a major trend in NLP
 First, intrinsic motivation rewards high data qual-                                                      has been towards developing bigger models, while
 ity: users who post situations are motivated to add                                                      making fewer changes to neural architecture and
 relevant details, and advice-givers are motivated                                                        the training objective. Almost all of today’s leader-
 to help the user. Second, task success can stably                                                        board models are Transformers (Vaswani et al.,
 evaluate long passages. Two pieces of advice rarely                                                      2017) trained through the maximum-likelihood
 mean exactly the same thing, but this does not mean                                                      training objective of predicting masked out (or
 we cannot evaluate which is more helpful.                                                                next) words. Our experimental results suggest that
 5.4                     Relation to existing NLP tasks                                                   even though these models struggle with Reddit-
                                                                                                          Advice, we are still able to measure small gains.
 Shared “core” tasks such as reading comprehension
                                                                                                             At the same time, our results suggest that scaling
 and natural language inference are of considerable
                                                                                                          up parameter counts might not be enough. With
 interest to the NLP community. Many datasets
                                                                                                          11 billion parameters, machines score 9% on our
 have been proposed for these tasks, and progress
                                                                                                          benchmark, versus 50% for humans (Figure 4).
 on them is often measured through auto-gradeable
                                                                                                             We hypothesize that a key challenge for our field
 correctness metrics. However, large models have
                                                                                                             13
 started to outperform humans on these datasets,                                                                Another idea is crowdsourcing data specifically to im-
                                                                                                          prove generation models, e.g. dialogue (Welleck et al., 2019).
    12                                                                                                       14
       This model is used for the HYPE leaderboard in computer                                                  To allow for study of this problem, we make the full
 vision (Zhou et al., 2019).                                                                              evaluation results – including the advice ratings – public.

                                                                                                     10
will be to move away from training models through            This research was supported in part by NSF (IIS-
word prediction objectives. We hypothesize that              1524371, IIS-1714566), DARPA under the CwC
word prediction makes a model uniquely suited for            program through the ARO (W911NF-15-1-0543),
correctness-based tasks, as word prediction itself is        DARPA under the MCS program through NIWC
a task with a single correct answer. However, word           Pacific (N66001-19-2-4031), and the NSF-GRFP
prediction might not necessarily lead to the forma-          No. DGE-1256082.
tion of mental models about the world. Other ideas
include different modeling choices (Bengio, 2017)
and learning paradigms (Mao et al., 2019). These             References
and other directions seem promising for building             Daniel Adiwardana, Minh-Thang Luong, David R So,
better models that are not just better advice-givers,          Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
but better at real-world language use in general.              Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
                                                               et al. 2020. Towards a human-like open-domain
5.6    Ethical implications; possible dual use                 chatbot. arXiv preprint arXiv:2001.09977.

One benefit of our proposal is that evaluating ma-           Ashton Anderson, Daniel P. Huttenlocher, Jon M.
chines by their language use, on tasks with intrinsic          Kleinberg, and Jure Leskovec. 2012. Effects of user
motivation, might enable progress towards social               similarity in social media. In WSDM ’12.
good applications. For example, machines might               Yoav Artzi, Nicholas FitzGerald, and Luke S Zettle-
one day help people who need advice – potentially              moyer. 2013. Semantic parsing with combinatory
on sensitive topics.                                           categorial grammars. ACL (Tutorial Abstracts), 3.
   However, we do not claim that our approach
is a panacea. We should approach purely tech-                Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
                                                               Christian Federmann, Mark Fishel, Yvette Gra-
nological solutions to societal problems (such as              ham, Barry Haddow, Matthias Huck, Philipp Koehn,
mental health care) with a grain of salt. Moreover,            Shervin Malmasi, Christof Monz, Mathias Müller,
progress on using language effectively might yield             Santanu Pal, Matt Post, and Marcos Zampieri. 2019.
models that cause harm, such as through generating             Findings of the 2019 conference on machine transla-
                                                               tion (WMT19). In Proceedings of the Fourth Con-
disinformation (Zellers et al., 2019b). We as a com-           ference on Machine Translation (Volume 2: Shared
munity should (continue to) study and be mindful               Task Papers, Day 1), pages 1–61, Florence, Italy. As-
of these kinds of dual use issues (Hovy and Spruit,            sociation for Computational Linguistics.
2016; Green and Viljoen, 2020).
                                                             Yoshua Bengio. 2017. The consciousness prior. arXiv
                                                               preprint arXiv:1709.08568.
6     Conclusion
In our work, we introduced new methodology for               Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian-
                                                               feng Gao, and Yejin Choi. 2020. Piqa: Reasoning
evaluating language tasks, reducing the gap be-                about physical commonsense in natural language. In
tween our benchmarks and the real world. We                    Thirty-Fourth AAAI Conference on Artificial Intelli-
also introduced a new challenge for the commu-                 gence.
nity, TuringAdvice, with an accompanying dataset
                                                             Silvia Bonaccio and Reeshad S. Dalal. 2006. Advice
and dynamic leaderboard, RedditAdvice. Today’s                  taking and decision-making: An integrative litera-
largest models struggle on RedditAdvice, so we are              ture review, and implications for the organizational
excited to see what new models get developed.                   sciences. Organizational Behavior and Human De-
                                                                cision Processes, 101(2):127 – 151.
Acknowledgements
                                                             Samuel R. Bowman, Gabor Angeli, Christopher Potts,
Thanks to the Reddit users who participate in its              and Christopher D. Manning. 2015. A large an-
advice subreddits – from asking for help, to writ-             notated corpus for learning natural language infer-
                                                               ence. In Proceedings of the 2015 Conference on
ing (and voting on) helpful advice. Thanks to the              Empirical Methods in Natural Language Processing,
Mechanical Turk workers who performed the anno-                EMNLP 2015, Lisbon, Portugal, September 17-21,
tation for our experiments. Thanks also to Katha-              2015, pages 632–642.
rina Reinecke, Oren Etzioni, Hannah Rashkin,
                                                             Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018.
Maarten Sap, Maxwell Forbes, Jesse Thoma-                      A semantic qa-based approach for text summariza-
son, Daniel Khashabi, Gabriel Ilharco, Swabha                   tion evaluation. In Thirty-Second AAAI Conference
Swayamdipta, and Yonatan Bisk for feedback.                     on Artificial Intelligence.

                                                        11
Chao-Min Chiu, Meng-Hsiang Hsu, and Eric TG Wang.                 Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,
  2006. Understanding knowledge sharing in virtual                  Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
  communities: An integration of social capital and                 Bigham. 2018. Vizwiz grand challenge: Answering
  social cognitive theories. Decision support systems,              visual questions from blind people. In Proceedings
  42(3):1872–1888.                                                  of the IEEE Conference on Computer Vision and Pat-
                                                                    tern Recognition, pages 3608–3617.
Ido Dagan, Oren Glickman, and Bernardo Magnini.
   2006. The pascal recognising textual entailment                Suchin Gururangan, Swabha Swayamdipta, Omer
   challenge. In Machine learning challenges. evalu-                Levy, Roy Schwartz, Samuel R. Bowman, and
   ating predictive uncertainty, visual object classifica-          Noah A. Smith. 2018. Annotation artifacts in nat-
   tion, and recognising tectual entailment, pages 177–             ural language inference data. In Proc. of NAACL.
  190. Springer.
                                                                  Tatsunori Hashimoto, Hugh Zhang, and Percy Liang.
Reeshad S Dalal and Silvia Bonaccio. 2010. What                     2019. Unifying human and statistical evaluation for
  types of advice do decision-makers prefer? Orga-                  natural language generation. In Proceedings of the
  nizational Behavior and Human Decision Processes,                 2019 Conference of the North American Chapter of
  112(1):11–23.                                                     the Association for Computational Linguistics: Hu-
                                                                    man Language Technologies, Volume 1 (Long and
David DeVault, Ron Artstein, Grace Benn, Teresa
                                                                    Short Papers), pages 1689–1701.
  Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon
  Gratch, Arno Hartholt, Margaux Lhommet, et al.                  Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
  2014. Simsensei kiosk: A virtual human interviewer                Yejin Choi. 2020. The curious case of neural text
  for healthcare decision support. In Proceedings of                degeneration. In ICLR. ICLR.
  the 2014 international conference on Autonomous
  agents and multi-agent systems, pages 1061–1068.                Dirk Hovy and Shannon L Spruit. 2016. The social
                                                                    impact of natural language processing. In Proceed-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                       ings of the 54th Annual Meeting of the Association
   Kristina Toutanova. 2019. Bert: Pre-training of                  for Computational Linguistics (Volume 2: Short Pa-
   deep bidirectional transformers for language under-              pers), volume 2, pages 591–598.
   standing. In Proceedings of the 2019 Conference of
   the North American Chapter of the Association for              Aaron Jaech, Victoria Zayats, Hao Fang, Mari Osten-
   Computational Linguistics: Human Language Tech-                  dorf, and Hannaneh Hajishirzi. 2015. Talking to the
   nologies, Volume 1 (Long and Short Papers), pages                crowd: What do people react to in online discus-
   4171–4186.                                                       sions? In EMNLP.
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.                Satwik Kottur, José Moura, Stefan Lee, and Dhruv
 Question answering as an automatic evaluation met-                  Batra. 2017. Natural language does not emerge
 ric for news article summarization. In Proceed-                    ‘naturally’ in multi-agent dialog. In Proceedings
 ings of the 2019 Conference of the North American                   of the 2017 Conference on Empirical Methods in
 Chapter of the Association for Computational Lin-                  Natural Language Processing, pages 2962–2967,
 guistics: Human Language Technologies, Volume 1                     Copenhagen, Denmark. Association for Computa-
 (Long and Short Papers), pages 3938–3948.                           tional Linguistics.
Michael C Frank and Noah D Goodman. 2012. Pre-
                                                                  Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
  dicting pragmatic reasoning in language games. Sci-
                                                                   Cann, Caiming Xiong, and Richard Socher. 2019.
  ence, 336(6084):998–998.
                                                                   Neural text summarization: A critical evaluation. In
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi,                Proceedings of the 2019 Conference on Empirical
 Alon Talmor, and Sewon Min. 2019. On making                       Methods in Natural Language Processing and the
 reading comprehension more comprehensive. In                      9th International Joint Conference on Natural Lan-
 Proceedings of the 2nd Workshop on Machine Read-                  guage Processing (EMNLP-IJCNLP), pages 540–
 ing for Question Answering, pages 105–112, Hong                   551, Hong Kong, China. Association for Computa-
 Kong, China. Association for Computational Lin-                   tional Linguistics.
 guistics.
                                                                  Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
Dave Golland, Percy Liang, and Dan Klein. 2010. A                   field, Michael Collins, Ankur Parikh, Chris Alberti,
  game-theoretic approach to generating spatial de-                 Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
  scriptions. In Proceedings of the 2010 conference                 Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
  on empirical methods in natural language process-                 Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
  ing, pages 410–419. Association for Computational                 Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
  Linguistics.                                                      ral questions: a benchmark for question answering
                                                                    research. Transactions of the Association of Compu-
Ben Green and Salomé Viljoen. 2020. Algorithmic                     tational Linguistics.
  realism: Expanding the boundaries of algorithmic
  thought. In Proceedings of the ACM Conference on                Himabindu Lakkaraju, Julian J. McAuley, and Jure
  Fairness, Accountability, and Transparency (FAT*).                Leskovec. 2013. What’s in a name? understanding

                                                             12
the interplay between titles, content, and communi-              Proceedings of the 2019 Conference on Empirical
  ties in social media. In ICWSM.                                  Methods in Natural Language Processing and the
                                                                   9th International Joint Conference on Natural Lan-
Angeliki Lazaridou, Alexander Peysakhovich, and                    guage Processing (EMNLP-IJCNLP), pages 4453–
  Marco Baroni. 2017. Multi-agent cooperation and                  4463.
  the emergence of (natural) language. ICLR.
                                                                 Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
Ronan Le Bras, Swabha Swayamdipta, Chandra Bha-                    wowarski, and Jacopo Staiano. 2019. Answers
  gavatula, Rowan Zellers, Matthew E. Peters, Ashish               unite! unsupervised metrics for reinforced summa-
  Sabharwal, and Yejin Choi. 2019. Adversarial filters             rization models. In Proceedings of the 2019 Con-
  of dataset biases. ArXiv, abs/2002.04108.                        ference on Empirical Methods in Natural Language
                                                                   Processing and the 9th International Joint Confer-
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B
                                                                   ence on Natural Language Processing (EMNLP-
   Tenenbaum, and Jiajun Wu. 2019. The neuro-
                                                                   IJCNLP), pages 3246–3256, Hong Kong, China. As-
   symbolic concept learner: Interpreting scenes,
                                                                   sociation for Computational Linguistics.
   words, and sentences from natural supervision. In
   ICLR.
                                                                 Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann               2018. Self-attention with relative position represen-
  Marcinkiewicz. 1993. Building a large annotated                  tations. In Proceedings of the 2018 Conference of
  corpus of English: The Penn Treebank. Computa-                   the North American Chapter of the Association for
  tional Linguistics, 19(2):313–330.                               Computational Linguistics: Human Language Tech-
                                                                   nologies, Volume 2 (Short Papers), pages 464–468.
James L McClelland, Felix Hill, Maja Rudolph, Ja-
  son Baldridge, and Hinrich Schütze. 2019. Ex-                  Noam Shazeer and Mitchell Stern. 2018. Adafactor:
  tending machine language models toward human-                    Adaptive learning rates with sublinear memory cost.
  level language understanding.    arXiv preprint                  In International Conference on Machine Learning,
  arXiv:1912.05877.                                                pages 4603–4611.

Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. So-           Emma Strubell, Ananya Ganesh, and Andrew McCal-
  cial influence bias: a randomized experiment. Sci-               lum. 2019. Energy and policy considerations for
  ence, 341 6146:647–51.                                           deep learning in nlp. In Proceedings of the 57th An-
                                                                   nual Meeting of the Association for Computational
Fernando Pereira. 2000. Formal grammar and informa-                Linguistics, pages 3645–3650.
  tion theory: together again? Philosophical Trans-
  actions of the Royal Society of London. Series A:              Naftali Tishby, Fernando C. Pereira, and William
  Mathematical, Physical and Engineering Sciences,                 Bialek. 1999. The information bottleneck method.
  358(1769):1239–1253.                                             In Proc. of the 37-th Annual Allerton Conference
                                                                   on Communication, Control and Computing, pages
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,                 368–377.
  Olga Uryupina, and Yuchen Zhang. 2012. Conll-
  2012 shared task: Modeling multilingual unre-                  Naftali Tishby and Noga Zaslavsky. 2015. Deep learn-
  stricted coreference in ontonotes. In Joint Confer-              ing and the information bottleneck principle. In
  ence on EMNLP and CoNLL-Shared Task, pages 1–                    2015 IEEE Information Theory Workshop (ITW),
  40. Association for Computational Linguistics.                   pages 1–5. IEEE.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
                                                                 Antonio Torralba and Alexei A Efros. 2011. Unbiased
  Ilya Sutskever. 2018. Improving language under-
                                                                   look at dataset bias. In Computer Vision and Pat-
  standing by generative pre-training. Technical re-
                                                                   tern Recognition (CVPR), 2011 IEEE Conference
  port, OpenAI.
                                                                   on, pages 1521–1528. IEEE.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,                Alan M. Turing. 1950. Computing Machinery and In-
  Wei Li, and Peter J. Liu. 2019. Exploring the limits             telligence. Mind, LIX(236):433–460.
  of transfer learning with a unified text-to-text trans-
  former. arXiv e-prints.                                        Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
                                                                   non. 2020. Fill in the blanc: Human-free quality
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and              estimation of document summaries. arXiv preprint
  Percy Liang. 2016. Squad: 100,000+ questions for                 arXiv:2002.09836.
  machine comprehension of text. In Proceedings of
  the 2016 Conference on Empirical Methods in Natu-              Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  ral Language Processing, pages 2383–2392.                        Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
                                                                   Kaiser, and Illia Polosukhin. 2017. Attention is all
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan                     you need. In Proceedings of the 31st International
 Le Bras, and Yejin Choi. 2019. Social iqa: Com-                   Conference on Neural Information Processing Sys-
 monsense reasoning about social interactions. In                  tems, pages 6000–6010. Curran Associates Inc.

                                                            13
You can also read