TuringAdvice: A Generative and Dynamic Evaluation of Language Use

Page created by Ryan Molina
 
CONTINUE READING
TuringAdvice: A Generative and Dynamic Evaluation of Language Use

                     Rowan Zellers♠ Ari Holtzman♠ Elizabeth Clark♠
                          Lianhui Qin♠ Ali Farhadi♠ Yejin Choi♠♥
      ♠
        Paul G. Allen School of Computer Science & Engineering, University of Washington
                             ♥
                               Allen Institute for Artificial Intelligence
                                   rowanzellers.com/advice

                       Abstract                                          I have to do a dissection for my high school
                                                                         class, but I’m distressed by dead animals.
                                                                         Last time we dissected an animal in class, I
     We propose TuringAdvice, a new challenge
                                                                         had a panic attack. I asked my teacher for
     task and dataset for language understanding                         another assignment, but she refused. I don't
     models. Given a written situation that a real                       want to play a 'victim' card, but I don't know
     person is currently facing, a model must gen-                       what to do. Help!
     erate helpful advice in natural language. Our
     evaluation framework tests a fundamental as-             I’d send a short email to the next
     pect of human language understanding: our                higher-up authority figure, ideally a
     ability to use language to resolve open-ended            counselor. Be forthright; it’s the
                                                              best approach when self-
     situations by communicating with each other.
                                                              advocating as a student.
                                                                                               Helpful
     Empirical results show that today’s models
     struggle at TuringAdvice, even multibillion pa-
                                                             Go to your teacher and say "I'm
     rameter models finetuned on 600k in-domain              asking you to do a project that
     training examples. The best model, a finetuned          requires me to see dead animals.

                                                                                                      T5
     T5, writes advice that is at least as helpful as        This is a dealbreaker." If she
     human-written advice in only 14% of cases; a            doesn’t concede, tell your principal
     much larger non-finetunable GPT3 model does             about your trauma.
                                                                                            Not helpful
     even worse at 4%. This low performance re-
     veals language understanding errors that are
     hard to spot outside of a generative setting,          Figure 1: TuringAdvice. Humans are natural experts at
     showing much room for progress.                        using language to successfully address situations that
                                                            arise, such as giving advice. We introduce a new frame-
                                                            work, dataset, and leaderboard to generatively evaluate
                                                            real-world language use. Today’s most powerful mod-
1    Introduction                                           els – which obtain near-human or superhuman perfor-
Language models today are getting ever-larger, and          mance on core NLP benchmarks for reading compre-
                                                            hension, natural language inference, and commonsense
are being trained on ever-increasing quantities of
                                                            reasoning – struggle with all of these capabilities when
text. For an immense compute cost, these models             generating advice, as highlighted in red.
like T5 (Raffel et al., 2019) and GPT3 (Brown et al.,
2020) show gains on a variety of standard NLP
benchmarks – often even outperforming humans.           ample - but not the only one possible - suggests that
   Yet, when a giant model like T5 generates lan- she send a short email to her guidance counselor.
guage, we observe clear gaps between machine-              On the other hand, not only is T5’s advice un-
level and human-level language understanding – helpful, it also reveals key misunderstandings of
even after it has been finetuned for the task at hand. the situation. It seems to believe that the student
Consider Figure 1, in which a woman asks for            is asking the teacher to do a class project involv-
advice. She is assigned to dissect an animal for        ing dead animals. This reading comprehension
her class project, but has extreme anxiety about        error is particularly strange, as T5 outperforms
dead animals – and her teacher refused to give her      humans on a variety of reading comprehension
another assignment. Humans can respond with             benchmarks. Others in the community have ob-
helpful advice, reflecting our unique ability of real- served similar issues, raising concerns about what
world language use: to communicate and tackle           today’s benchmark datasets measure (Yogatama
open-ended issues. The helpful advice in this ex- et al., 2019; Kryscinski et al., 2019; McClelland
                                                     4856
                                                      1
                         Proceedings of the 2021 Conference of the North American Chapter of the
               Association for Computational Linguistics: Human Language Technologies, pages 4856–4880
                            June 6–11, 2021. ©2021 Association for Computational Linguistics
et al., 2019; Gardner et al., 2019).                    incredibly challenging for NLP models. Today’s
   We argue that there is a deep underlying issue: largest finetunable model, T5 with 11 billion param-
a gap between how humans use language in the            eters, produces advice that is preferable to human-
real world, and what benchmarks today can mea-          written advice 14.5% of the time – after being fine-
sure. Today’s dominant paradigm is to study static      tuned on 600k examples. GPT3, an even larger
datasets, and to grade machines by the similarity of    model with 175 billion parameters that was not re-
their output with predefined correct answers. For       leased for finetuning, does even worse at 4%. Even
example, we score multiple choice exams by how          more concerning, our evaluation finds that it often
often the correct answers are chosen, and evaluate      generates hateful and toxic language.
generative tasks like machine translation by simi-         We also study our task from the perspective of to-
larity with respect to correct translations. However, day’s standard ‘core’ NLP tasks. Broadly, we find
when we use language in the real world to com- that machines frequently confuse who is who, are
municate with each other – such as when we give         self-contradictory, or seem to miss important world
advice, or teach a concept to someone – there is        knowledge. However, these mistakes tend not to
rarely a universal correct answer to compare with, fall into the neat categories defined by standard
just a loose goal we want to achieve.                   task definitions. We address this by introducing di-
   We introduce a framework to narrow this gap          agnostic questions, which systematically measure
between benchmarks and real-world language use. these language understanding errors.
We propose to evaluate machines by their success           In summary, our paper makes three contribu-
in using language to (1) communicate with humans        tions. First, we introduce a new framework for
in (2) tackling complex, open-ended, real-world         measuring language understanding through directly
situations. Our goal is a machine that, like a human, tackling real-world language problems. Second,
can generate language that is useful and helpful. we introduce TuringAdvice as a new challenge
Doing so necessarily requires a deep understanding      for AI systems, along with a dynamic dataset and
of language and the world, as per a line of thought     leaderboard. Third, we connect our task to exist-
that the complete meaning representation is one         ing atomic NLP tasks, introducing a new setting
that suffices to complete a task (Artzi et al., 2013). that reveals where progress is still needed.
   As a case-study of our framework, we introduce
TuringAdvice as a new grand challenge for AI sys-
                                                        2 Real World Language Use
tems. A machine reads a situation written by a          We propose to evaluate machines by their success
person seeking advice, like Figure 1, and must then     at real-world language use: using language to com-
write advice that is helpful to the advice-seeker. municate with a human, in response to a naturally
Like a Turing Test (Turing, 1950), we establish a       occurring situation, in order to achieve a desired
simple condition required for a model to ‘pass’: outcome. This is how educators often measure (hu-
model-generated advice must be at least as helpful      man) language understanding of a second language
to the advice-seeker as human-written advice.           – by how well the learner can use the language
   We make our challenge concrete by introducing        (Council of Europe, 2001). Our approach is also
a new dataset, RedditAdvice, and accompanying           inspired by Wittgenstein’s notion of semantics, that
leaderboard. We tie our dataset to the Reddit com- “meaning is use:” language is grounded in our de-
munity, which resolves two additional sources of        sire to make sense of one another and cooperate to
bias. First, Reddit users are intrinsically motivated, meet our needs (Wittgenstein, 1953).
seeking advice about highly complex real issues            As machines do not have humanlike needs or
– which past work suggests differ from hypotheti- desires, we propose to evaluate machines’ success
cal issues that crowd workers might come up with        at a task by how well it serves a human who is
(e.g. Kwiatkowski et al., 2019; Gurari et al., 2018). interested in the outcome. For example, if a ma-
Second, we make our dataset dynamic, not static – chine orders food on my behalf, then I can evaluate
models are evaluated over Reddit situations posted      it based on whether I enjoy the dish it ordered.
over the previous two weeks at the time of submis- Though this requires careful task selection in order
sion. Models therefore, like humans, must general- to make things feasible for current models, as we
ize to new situations and patterns of language.         will show in Section 3, it results in a powerful and
   Experimental results show that TuringAdvice is       reliable human evaluation.
                                                     4857
                                                      2
2.1   Related work                                             proxy tasks that are easy to evaluate, while (hope-
2.1.1 Pragmatics in NLP                                        fully) correlating with the underlying true task. For
                                                               example, SWAG (Zellers et al., 2018) is a multiple-
Our evaluation relates to pragmatics in NLP, where
                                                               choice proxy task and dataset introduced to study
communication is modeled also through listeners
                                                               the true task of commonsense reasoning.
and speakers (Golland et al., 2010; Frank and Good-
                                                                  However, there are gaps between datasets for
man, 2012). One approach is to introduce a com-
                                                               proxy tasks (e.g. multiple choice), and the core
munication game, with an explicit objective. For
                                                               tasks they seek to represent (e.g. commonsense
example, Wang et al. (2016) study a blocks world
                                                               reasoning), which we discuss in the next sections.
where humans give commands to a block-placing
machine. The machine is then graded on accuracy.               2.2    Can language use really be measured
Our proposed evaluation instead covers complex                        through correctness over proxy tasks?
everyday scenarios faced by a human, where the
                                                               When we reduce a complex language task to a
objective is to help them as much as possible.
                                                               simplified setup, with a small label space (like
   Pragmatics can also be studied through machine-
                                                               multiple-choice classification), we run the risk of
machine communication; e.g., through emergent
                                                               introducing artifacts and biases: patterns that can
language (Lazaridou et al., 2017). Recent work
                                                               be exploited in the simplified setup, but that are not
uses pretrained question-answering models to eval-
                                                               representative of the true task (Gururangan et al.,
uate summarization models (Chen et al., 2018;
                                                               2018; Zellers et al., 2019a). Artifacts can enable
Scialom et al., 2019; Eyal et al., 2019; Vasilyev
                                                               machines to even outperform humans at the final
et al., 2020). However, ensuring that machines
                                                               benchmark, without solving the underlying task.
communicate in standard English is difficult, as
                                                                  While the problem of artifacts has recently taken
there is usually a more efficient machine-language
                                                               the spotlight in the NLP community, partially be-
coding scheme for the task (Kottur et al., 2017).
                                                               cause large Transformers (Vaswani et al., 2017)
2.1.2 Two major approaches for evaluation                      excel at picking up on artifacts, there is a deeper
Today, we see two major approaches for NLP eval-               underlying issue. One way to view simplified tasks
uation, which we discuss below.                                is that in order to correctly map inputs X to labels
   Quality of generations. The first approach stud-            Y, a machine must learn a set of attributes A that
ies generative tasks like chit-chat dialogue or story-         are representative of the ‘true’ task. We can upper-
writing, and measures the inherent quality of gen-             bound the information contained by A through the
erations, often through attributes such as “sensi-             information bottleneck principle of Tishby et al.
bleness” and “specificity” (e.g., Venkatesh et al.,            (1999). An efficient model minimizes the follow-
2018; Hashimoto et al., 2019; Adiwardana et al.,               ing, for some β ą 0:
2020). This approach is orthogonal to ours: though
these attributes might be desirable, they are often                           min IpX; Aq ´ βIpA; Yq,                  (1)
                                                                              ppa|xq
insufficient to guarantee success at a task.
   Correctness. The second (and perhaps more                   where I is mutual information. In other words, the
common) approach is to evaluate models through                 model will learn attributes A that maximally com-
correctness over static datasets. For example, ma-             press the inputs X (minimizing IpX; Aq), while also
chines can be graded by the similarity of their gen-           remaining good predictors of the labels Y (max-
erated translation to correct translations,1 or, by            imizing IpA; Yq). However, the label prediction
how often they choose the correct answer on a mul-             term is bounded by the information (or entropy, H)
tiple choice exam. Many goal-oriented dialogue                 of the label space:
and semantics tasks are also evaluated in this way,
as a model is evaluated by whether it makes the                        IpA; Yq “ HpYq ´ HpY|Aq ď HpYq.                 (2)
correct API call, or produces a correct parse.
   Since many language tasks cannot be evaluated                         Thus, for a task with a small label space, there
through correctness, researchers often introduce                      is no guarantee that a model will learn high-
                                                                      information content attributes. Models are in fact
    1
      Models submitted to the 2019 Conference on Machine              encouraged to overfit to dataset artifacts, and to
Translation were evaluated (by humans) on how well the
model’s translations agreed with either (1) human-written             unlearn linguistically useful information that is not
translations, or, (2) original source text (Barrault et al., 2019).   directly relevant to predicting Y (Pereira, 2000).
                                                                  4858
                                                                    3
An alternate approach is to make datasets harder           (Bonaccio and Dalal, 2006). Thus, we as humans
adversarially, so as to have fewer artifacts (Zellers         have inherent familiarity with the task, and what
et al., 2018, 2019a; Le Bras et al., 2020). However,          it means for advice to be helpful – making it easy
it might be impossible to make a dataset with no              to evaluate, as we later show empirically. More-
artifacts, or to know if one has been created.                over, because there are many internet communities
   Our proposal, to evaluate models by their real-            devoted to advice-giving, training data is plentiful.
world language use, addresses the information bot-               Second, the framework of advice-giving allows
tleneck issue in two ways. First, when we use                 us to study subtasks such as reading comprehen-
language in the real world, the mapping between               sion and natural language inference (Section 5.3);
possible inputs and outputs is often highly complex.          we argue both of these are needed to consistently
For example, the space of possible advice is vast,            give good advice. Learning to recognize advice
and many pieces of advice might be equally helpful            has recently been studied as an NLP task on its
given a situation. Second, we directly tackle lan-            own (Govindarajan et al., 2020), though we are not
guage problems, without introducing a correctness-            aware of past work in learning to generate advice.
based proxy that machines might overfit to.
                                                              3.1     RedditAdvice: A dynamic dataset for
2.3    Static datasets in a dynamic world                             evaluating advice
To evaluate performance on a real-world task by              We propose to evaluate models dynamically,
means of a dataset, we (implicitly) assume that              through new situations and advice that are posted
the dataset is a good representation of the world            to Reddit. We call our dynamic dataset Reddit-
(Torralba and Efros, 2011). This might be question-          Advice. Many of Reddit’s subcommunities (or
able when it comes to real-world language use, as            ‘subreddits’) are devoted to asking for and giv-
static datasets necessarily capture historic patterns        ing advice, with subreddits for legal, relationship,
of language. For instance, syntactic understand-             and general life advice.2 During evaluation time,
ing is often evaluated using the Penn Treebank,              we will retrieve new situations from Reddit as a
with news articles from 1989 (Marcus et al., 1993).          new test set for models. Workers on Mechanical
However, the world is constantly evolving, along             Turk then grade the model-written advice versus
with the language that we use.                               the Reddit-endorsed human-written advice.
   To bridge this gap, we propose to evaluate ma-
chines by their interactions with humans in the               3.1.1    How advice-giving works on Reddit
present. Models therefore must learn to perform               Suppose a Reddit user faces an issue that they are
the underlying language task, even for novel situa-           seeking advice about. First, they write up situation
tions, rather than fitting to the historic distribution       and post it to an advice-oriented subreddit. Users
of a fixed test set. We make this notion concrete             then reply to the situation, offering advice.
in the next section, where we introduce a dynamic                Importantly, any user can ‘upvote’ or ‘downvote’
dataset and leaderboard for evaluating advice.                the advice as well as the situation itself - changing
                                                              its score slightly. Top-scoring advice is deemed by
3     TuringAdvice: a New Challenge for                       the wisdom of the crowd as being the most helpful.3
      Natural Language Understanding
                                                       3.1.2 The ideal evaluation - through Reddit?
As a case study of our framework, we introduce
TuringAdvice, a new challenge task for AI systems      In a sense, human advice-givers are ‘evaluated’ on
to test language understanding. The format is sim- Reddit by the score of their advice – representing
ple: given a situation expressed in natural language, how well their advice has been received by the
a machine must respond with helpful advice. To         community. Similarly, the ideal model evaluation
pass the challenge, machine-written advice must        might be to post advice on Reddit directly. If the
be at least as helpful to the advice-seeker as human- model writes helpful advice, it should be upvoted.
written advice, in aggregate.                              2
                                                             We use advice from the following subreddits: Love,
   We focus on advice for a few reasons. First, Relationships, Advice, NeedAdvice, Dating_Advice, Dating,
advice-giving is both an important and an everyday     Marriage, InternetParents, TechSupport, and LegalAdvice.
                                                           3
                                                             This is somewhat of a simplification, as other factors also
task. People ask for and give advice in settings       influence what gets upvoted (Anderson et al., 2012; Lakkaraju
as diverse as relationship advice and tech support     et al., 2013; Muchnik et al., 2013; Jaech et al., 2015).
                                                    4859
                                                     4
Given:      Situation            Advice A          Advice B              We show an overview of our Mechanical Turk
                                                                       task in Figure 2. A worker is given a situation and
 1. Which piece of advice is more helpful?
                                                                       two pieces of advice. One is the top-scoring ad-
  Definitely A       Slightly A       Slightly B      Definitely B
                                                                       vice from Reddit, and the other is model-generated
                                                                       advice; the worker is not told which is which.
 2. How helpful is the worse advice (A) to the question-asker?
                                                                          The worker first chooses the more helpful piece
  Slightly helpful          Not helpful             Dangerous
                                                                       of advice, then provides diagnostic information for
                                                                       the less helpful advice – rating it Slightly helpful ,
 3. Is Advice A worse         3. Could Advice A be applicable to
 mainly due to its            (and helpful in) a different situation?    Not helpful , or Dangerous . If the worse piece of
 meaning, or its writing?                                              advice was Slightly helpful , they choose whether
  Meaning      Writing         Possibly helpful     Never helpful
                                                                       it is worse due to a Meaning problem or a
                                                                        Writing problem . Otherwise, they choose if the
Figure 2: Crowdsourcing workflow. Mechanical Turk                      worse advice could be Possibly helpful in some
Workers are given a situation, and two pieces of advice.               other situation, or Never helpful in any situation.
First, they choose which is more helpful (here, B). Sec-                  Three workers rate each model-situation pair,
ond, they rate the helpfulness of the worse advice (A);
                                                                       and ratings are combined using a majority vote. We
last, they answer a diagnostic question.
                                                                       follow best practices on Mechanical Turk, using a
                                                                       qualification exam, paying workers at least $15 per
  However, there is a significant ethical problem                      hour, and giving feedback to workers. Still, eval-
with this approach. The users who post advice                          uation is highly economical at $1.86 per example-
questions are real people, with real problems. A                       model pair, or roughly $400 per model evaluated.
user might read advice that was originally written
by a machine, think it was human-endorsed, and                         3.2    A large static dataset for training
do something harmful as a result. For this reason,                     We present RedditAdvice2019, a large static
we take an alternate crowdsourcing approach.                           dataset for training advice-giving models. Because
                                                                       today’s models have extreme reliance on data for
3.1.3     A crowdsourced, hybrid evaluation –
                                                                       finetuning, we collect data that is in the exact same
          through Mechanical Turk
                                                                       format as RedditAdvice, yet we expand our selec-
We propose a hybrid approach for dynamic evalua-                       tion criteria, optimizing for recall rather than preci-
tion of models. While the situations, and reference                    sion (Supp A.2). In total, we extract 616k pieces
advice come from Reddit, we hire workers on Me-                        of advice, over 188k situations.
chanical Turk to rate the relative helpfulness of                         To mirror the dynamic nature of the evaluation,
machine-written advice. Not only is this format                        in which models are evaluated on situations posted
more ethical, it also lets us collect diagnostic rat-                  in 2020 and beyond, we split our dataset into static
ings, allowing us to quantitatively track the natural                  training and validation sets by date.4
language understanding errors made by machines.
We made our crowdsourcing task as fulfilling as                        4     Experimental Results on RedditAdvice
possible - using popular situations from Reddit,
and pitching the work in terms of helping people.                      In this section, we report results from one round of
We received feedback from many workers that our                        dynamic evaluation on RedditAdvice. We evaluate
tasks were entertaining and fun, suggesting that our                   the following strong NLP models and baselines:
workers are to some degree intrinsically motivated.                    a. Rule-based: a templated system to give legal,
                                                                           relationship, or life advice. The system first
3.1.4     Mechanical Turk annotation setup                                 chooses randomly empathetic sentence from
In a single round of evaluation, we retrieve 200                           ten choices, for example “I’m sorry you’re
popular Reddit situations that were posted in the                          facing this.” It then chooses a random piece
last two weeks. For each situation, we retrieve                            of advice that is loosely related to the situa-
the top-rated advice from Reddit, and generate one                         tion’s topic; we infer this from the subreddit
piece of advice per model. Workers on Mechanical                           the situation was posted on. For example, for
Turk then compare the helpfulness of the model-          4
                                                           Our training set contains 600k pieces of advice from July
generated advice with human-written advice, and      2009 to June 14, 2019; validation contains 8k from June 14 to
provide diagnostic ratings.                          July 9th 2019.
                                                  4860
                                                   5
% Frequency that model advice is preferred over best Reddit advice
100%
                                                                                                                   Rule-              1.5%        1.5%          2.0%*       12.0%*    38.5%*
       Model advice preferred                                                                                     Based
 80%                                                                                                             TF-IDF
                                                                                                                Retrieval                          0.0%         0.5%*       10.5%*    37.0%*

                                                                                          Reference model
 60%                                                                                                         GPT3-175B                                          0.5%*       10.5%*    37.0%*
                                                                               41.0%
                                                                                                                Grover-                                                     10.0%*    36.5%*
                                                                                                             Mega(1.5B)
       Reddit advice preferred

 40%
                                                                                                                 T5-11B         *: Significant with p < .01                           26.5%*
                                                           14.5%
 20%                                                                                                                              : Significant with p < .05
                                                                                                             Second-best
                                          4.0%     4.5%              4.0%                                   Reddit advice       Not significant
                                 2.5%
                                                                                                                            Rule-     TF-IDF     GPT3-175B      Grover-     T5-11B   Second-best
 0%                                                                                                                         Based    Retrieval                 Mega(1.5B)            Reddit advice
                                 Rule-    TF-IDF Grover- T5-11B GPT3-175B Second-best                                                             Compared model
                                 Based   Retrieval Mega(1.5B)             Reddit advice
                                                                                            Figure 4: Improvement (in absolute percentage %) be-
Figure 3: Helpfulness of models relative to top-scoring
                                                                                            tween pairs of models, along with statistical signifi-
Reddit advice. We show results over 200 shared situ-
                                                                                            cance from a paired t-test. The improvement of T5-11B
ations; we also show bootstrapped 95% confidence in-
                                                                                            over smaller models like Grover-Mega is highly statis-
tervals. Advice from the best-scoring model, T5-11B,
                                                                                            tically significant (10% gap, pă.01), while being far
is preferred 14.5% over top-scoring Reddit advice. We
                                                                                            worse than human performance. Our evaluation thus
also compare the second-top scoring piece of Reddit
                                                                                            meaningfully grades varying levels of performance.
advice, which scores 41% – worse than the best advice
(50% by definition), but better than any model.

     LegalAdvice the model might write “I’d suggest                                             prompt due to the length of situation-advice
     getting a lawyer immediately.”                                                             pairs; we instead mimic the formatting of a
b.   TF-IDF retrieval: for a new situation, we com-                                             website quoting from Reddit (Appendix B.5).
     pute its TF-IDF bag-of-word vector and use it                                          Last, to quantify the measurement error of our eval-
     to retrieve the most similar situation from the                                        uation, we additionally evaluate:
     training set. We then reply with the top-scoring                                       f. the second-highest rated Reddit advice for each
     advice for that situation.                                                                 situation. We send this advice through the same
c.   Grover-Mega (Zellers et al., 2019b): a left-to-                                            pipeline as machine-written advice.
     right transformer model with 1.5 billion pa-                                              We finetune all models (except GPT3) and gen-
     rameters. Grover was pretrained on news ar-                                            erate using Nucleus Sampling (Holtzman et al.,
     ticles with multiple fields, perhaps making it                                         2020); more details in Appendix B.
     a good fit for our task, with multiple fields of                                          In our study, we exclude purely bidirectional
     context (like the subreddit, date, and title). Our                                     models, such as BERT (Devlin et al., 2019). While
     situation-advice pairs are often quite long, so                                        these models can be made to generate text, these
     we adapt Grover for length; pretraining it on                                          generations are usually worse than those of left-to-
     sequences of up to 1536 characters.                                                    right models (Wang and Cho, 2019). T5 also tends
d.   T5 (Raffel et al., 2019): a sequence-to-                                               to outperform them, even on discriminative tasks.
     sequence model with a bidirectional encoder
     and a left-to-right generator, with 11 billion                                         4.1                    Quantitative results
     parameters. T5 was trained on a large dataset                                        In Figure 3, we show overall results for one evalua-
     of cleaned web text. At the time of writing,                                         tion trial, which featured 200 situations posted on
     T5 is the top-scoring model on the Glue and                                          Reddit from October 28 to November 7, 2020. As
     SuperGlue benchmarks (Wang et al., 2019b,a),                                         a key metric for measuring the relative usefulness
     scoring above human performance on Glue and                                          of model-written advice, we evaluate the frequency
     near human-performance on SuperGlue.                                                 by which workers prefer the Reddit-written refer-
e.   GPT3 (Brown et al., 2020): a left-to-right                                           ence advice over the model-written advice. If a
     transformer model with 175 billion parameters.                                       model’s advice was just as helpful as human advice
     GPT3 must be “prompted” to generate advice                                           in aggregate, then that model would score 50%.
     since it has not been released for finetuning.                                          Model performance is quite low. The best model,
     We cannot provide few-shot examples in the                                           T5-11B, scores 14.5%, outperforming a smaller
                                                                                          Grover-Mega (4.5%); GPT3 does worse at 4.0%.
                                                                                       4861
                                                                                        6
The rule-based and TF-IDF baselines are competi-          legaladvice: NJ Work will not let us eat in the building due to
                                                          covid outbreak. We can’t leave the property for breaks. They
tive at 2.5% and 4.0% accuracy respectively.              have a tent outside for us to eat in but it’s getting cold out.
   As additional comparison to the 50% upper              I am a manager at my work part of my responsibilities is to
bound, the second-highest scoring Reddit advice           stay on the property during my entire shift we are given a 1
                                                          hour paid lunch break that we do not have to clock out for.
scores 41%. This suggest that our workers and             However due to recent covid outbreak the new policy is no
often prefer the same advice as Reddit users.             eating in the building. They had set up a tent outside for us
                                                          to eat in but it is starting to get cold out as well there are not
4.1.1 Measurement error                                   heaters in there. My question is can they legally do this to us?
                                                          Also I work 4:00 p.m. to midnight shift.
To investigate the measurement error of our evalu-
                                                          Top    So by "clock out for" you are not a salaried exempt
ation, in Figure 4 we report the statistical signifi-     Reddit employee? Do the workers under you have a union?
cance between pairs of models; details about how          advice Aside from the cold, New Jersey doesn’t require em-
this is computed are in Appendix C. We observe                   ployers to give *any* meal breaks for employees 18
                                                                 or older, so they could take away the breaks and the
a large gap in performance between T5 and the                    tent if they wanted, and just say "no eating" (though
other baselines. For example, its improvement over               water is required). What is your relationship with the
Grover-Mega is 10%, which is highly statistically                other managers, and room to figure this out, ask for
                                                                 heaters, etc.?
signficant. On the other hand, the differences in
                                                          Rule        If I were you, I’d get a lawyer immediately.
performance between other models are more minor           Based       De nitely less helpful Slightly helpful
– GPT3 does not outperform TF-IDF, and though                         Meaning problem
it outperforms the rule-based system by 1.5%, it is       TF-         You were blackout drunk, which means that you were
only somewhat statistically significant.                  IDF         not capable of consent. . . . If anyone should feel vio-
                                                          Re-         lated, it’s you.
   Overall, the statistical significance results sug-     trieval     De nitely less helpful Not helpful Possibly helpful
gest that our evaluation can stably rank model per-       Grover-     If you are required to be on the premises, they are
formance. This, along with the finding that model         Mega        required to provide you with a place to eat.
                                                                      https://www.nj.gov/labor/es/faq.html#13
performance is low on our task suggests that there
                                                                      De nitely less helpful Not helpful Possibly helpful
is ample room for growth on RedditAdvice.                 T5-         If your shift is from 4pm to midnight, is there a reason
                                                          11B         that you can’t leave the property?
5     Analysis and Discussion                                         De nitely less helpful Not helpful Never helpful
                                                          GPT3- A place of employment is considered the "premises".
So far, we have shown that we are able to reli-           175B You can’t just assume that because you can’t walk out
ably evaluate models in our dynamic setup, and                  the door that you can eat wherever you want. This is
                                                                work hours. You’re not in your home. The restaurant,
that doing so results in model performance that is              or at least the food, is all provided for you for your
significantly lower than human performance.                     benefit. But if there are other people there, then I
   To break down what this gap in performance                   could see how it could be a safety issue.
                                                                      De nitely less helpful Not helpful Never helpful
means, we show a qualitative example in Figure 5.
A user is asking for online legal advice about being      Figure 5: Qualitative example; more in Supp. F.
stuck at work for their entire 4pm-midnight shift –       Though machine-generated advice matches keywords
with no eating allowed due to COVID-19. The top-          from the situation, it is frequently not helpful or even
rated Reddit advice understands this situation and        self-contradictory. The issues are due to critical er-
then offers advice, suggesting the advice-seeker          rors in natural language understanding, such as reading
                                                          comprehension, entailment, and coreference.
organize alongside other workers – as “New Jersey
doesn’t require employers to give any meal breaks.”
   Machine advice seems to misunderstand the is-          better, and then annotate problems with the worse
sue. T5 asks if there is “a reason that you can’t         piece of advice. We found workers had high agree-
leave the property,” even though this reason is           ment during the diagnostic annotation.5
stated in the situation. GPT3’s advice is self-
                                                             In Figure 6, we show the distribution of
contradictory; it also makes something up about a
                                                          the ratings for model-written, versus human-
“restaurant” in the building.
                                                          written advice. Machine-written advice that was
5.1    Problems with machine-written advice                   5
                                                                  For the classifying machine-written advice as ‘helpful’
As part of our evaluation, we wish to quantita- versus ‘not helpful’ or ‘dangerous’ (combining the two latter
tively measure problems with machine-written ad- categories into one), we have κ“0.689. For breaking down
                                                    helpful advice into ‘meaning problem’ versus a ‘writing prob-
vice. Recall that in our crowdsourcing setup (Sec- lem’, we have Cohen’s κ“0.613; for rating unhelpful advice
tion 3.1.3), three workers select which advice is   as ‘possibly helpful’ versus ‘never helpful,’ we have κ“0.602.
                                                 4862
                                                  7
100%
                                                                         Frequency (%) of advice ratings
 80%                                     Preferred over top-rated Reddit advice                      Not helpful (possibly helpful elsewhere)
                      66%                Slightly helpful (with a writing problem)                   Not helpful (never helpful elsewhere)
 60%                                     Slightly helpful (with a meaning problem)                   Dangerous
 40%                                                                                                          41%
                                                         33%                                                        32%
 20%                                               23%                         26% 22%
                                                                                         21%                              19%
                                         15% 13%                         14%                   13%
             9% 5%        10%                                  10%
  0%      4%                  4%    4%                                                               3%                       4% 2% 1%
                 TF-IDF                    GPT3-175B                               T5-11B                            Second-best
                Retrieval                                                                                            Reddit advice

Figure 6: Distribution of ratings for three models: TF-IDF retrieval, GPT3, and T5, along with ratings for the
second-best rated Reddit advice. Though deep generators like GPT3 and T5 are often preferred over the retrieval
baseline, they also often write advice that would never be helpful (33% GPT3, 13% T5), and that is racist, sexist,
or otherwise dangerous (10% GPT3, 3% T5).

not preferred over human-written advice can                          3.1.4), which we authors will pay in the short term.
have the following ratings. It can be rated as                       An alternative strategy requires submitters to pay
 Slightly helpful (but, was rated as worse mainly                    the Mechanical Turk fees themselves; this model
due to a Meaning problem or Writing problem ),                       was used for the HYPE leaderboard in computer
as Not helpful , or Dangerous .                                      vision (Zhou et al., 2019).
   The diagnostics show several patterns. First, all
models frequently commit natural language under-                     5.3    Relation to existing NLP tasks
standing errors, such as internal contradiction. Be-                 Shared “core” tasks such as reading comprehension
cause of this, we find that TF-IDF bag-of-words                      and natural language inference are of considerable
retrieval is competitive with that of large generators.              interest to the NLP community. Many datasets
While retrieved advice is often irrelevant (66% of                   have been proposed for these tasks, and progress
the time), it is almost never complete gibberish, as                 on them is often measured through auto-gradeable
it comes from top-scoring advice. Only 10% of                        correctness metrics. However, large models have
workers rated this advice as Not helpful for any                     started to outperform humans on these datasets,
situation, less than T5.                                             raising doubt that further progress on them brings
   Second, they suggest that models struggle                         us closer to human-level language understanding.
even more without finetuning. A GPT3 model                              We argue two things: first, that many NLP tasks
with careful prompting generates language that is                    are necessary components of giving advice, and sec-
 Dangerous 10% of the time. These qualitative                        ond, that because giving advice remains far from
and quantitative results confirm a pattern observed                  solved, these tasks are also far from solved. In
by many others, that large language models like                      Appendix F, we study problems with advice from
GPT3 often generate explicitly racist and sexist lan-                T5-11B from the point of view of existing NLP
guage out-of-the-box Sheng et al., 2019; Gehman                      tasks. For instance, machine advice often contra-
et al., 2020; Bender et al., 2021, among others).                    dicts itself, suggesting that today’s systems struggle
We explore this further in Supplemental F. This is                   with the general task of natural language inference.
perhaps worrying, since GPT3 is presently being                      We have made these diagnostics publicly available
commercialized.                                                      to enable progress on automatically spotting these
                                                                     mistakes.
5.2    A Leaderboard for Advice Evaluation
                                                                     6     Conclusion; Ethical Considerations
So far, we have shown results from one evaluation
round; a second is in Supplemental D. We propose      We introduced new methodology for evaluating lan-
a dynamic leaderboard to keep that evaluation on- guage tasks, reducing the gap between benchmarks
going, at rowanzellers.com/advice.                    and the real world. We also introduced a new chal-
   Users submit a model API to be dynamically         lenge for the community, TuringAdvice, with an
evaluated. Each new model, along with the highest     accompanying dataset and dynamic leaderboard.
rated previously-evaluated model, will be evaluated      Yet, if our field is to progress towards NLP mod-
for an additional round using the same approach. els that ‘understand natural language,’ we should
The cost of each evaluation is reasonable (Section    be cognizant of the impact that such technology
                                                   4863
                                                    8
might have on society. In this paper, we presented         Findings of the 2019 conference on machine transla-
a sketch of NLP models helping people who need             tion (WMT19). In Proceedings of the Fourth Con-
                                                           ference on Machine Translation (Volume 2: Shared
advice on sensitive topics, which could be a mea-
                                                           Task Papers, Day 1), pages 1–61, Florence, Italy. As-
surable goal for the field.                                sociation for Computational Linguistics.
   At the same time, we do not claim that our ap-
proach is a panacea. There are almost certainly         Emily M Bender, Timnit Gebru, Angelina McMillan-
better non-technical solutions to ensure mentorship       Major, and Shmargaret Shmitchell. 2021. On the
                                                          dangers of stochastic parrots: Can language models
and legal advice for all (Green, 2019). Moreover,         be too big? In Proceedings of the 2021 ACM Confer-
there are significant dual-use risks with models          ence on Fairness, Accountability, and Transparency,
that understand language (Hovy and Spruit, 2016;          pages 610–623.
Green and Viljoen, 2020). Our evaluation measures
                                                        Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian-
some risks of generative models – such as the ten-
                                                          feng Gao, and Yejin Choi. 2020. Piqa: Reasoning
dency to generate toxic language – but more work          about physical commonsense in natural language. In
in this area is needed.                                   Thirty-Fourth AAAI Conference on Artificial Intelli-
                                                          gence.
Acknowledgements
                                                        Silvia Bonaccio and Reeshad S. Dalal. 2006. Advice
Thanks to the Reddit users who participate in its          taking and decision-making: An integrative litera-
                                                           ture review, and implications for the organizational
advice subreddits – from asking for help, to writ-         sciences. Organizational Behavior and Human De-
ing (and voting on) helpful advice. Thanks to              cision Processes, 101(2):127 – 151.
the Mechanical Turk workers who performed the
annotation for our experiments. Thanks also to          Samuel R. Bowman, Gabor Angeli, Christopher Potts,
the three anonymous reviewers, along with Katha-          and Christopher D. Manning. 2015. A large an-
                                                          notated corpus for learning natural language infer-
rina Reinecke, Oren Etzioni, Hannah Rashkin,              ence. In Proceedings of the 2015 Conference on
Maarten Sap, Maxwell Forbes, Jesse Thoma-                 Empirical Methods in Natural Language Processing,
son, Daniel Khashabi, Gabriel Ilharco, Swabha             EMNLP 2015, Lisbon, Portugal, September 17-21,
Swayamdipta, and Yonatan Bisk, for feedback.              2015, pages 632–642.
This research was supported in part by NSF (IIS-        Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
1524371, IIS-1714566), DARPA under the CwC                Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
program through the ARO (W911NF-15-1-0543),               Neelakantan, Pranav Shyam, Girish Sastry, Amanda
DARPA under the MCS program through NIWC                  Askell, et al. 2020. Language models are few-shot
                                                          learners. arXiv preprint arXiv:2005.14165.
Pacific (N66001-19-2-4031), and the NSF-GRFP
No. DGE-1256082.                                        Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018.
                                                          A semantic qa-based approach for text summariza-
                                                           tion evaluation. In Thirty-Second AAAI Conference
References                                                 on Artificial Intelligence.

Daniel Adiwardana, Minh-Thang Luong, David R So,        Council of Europe. 2001. Common European Frame-
  Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,      work of Reference for Languages: learning, teach-
  Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,          ing, assessment. Cambridge University Press.
  et al. 2020. Towards a human-like open-domain
  chatbot. arXiv preprint arXiv:2001.09977.             Ido Dagan, Oren Glickman, and Bernardo Magnini.
                                                           2006. The pascal recognising textual entailment
Ashton Anderson, Daniel P. Huttenlocher, Jon M.            challenge. In Machine learning challenges. evalu-
  Kleinberg, and Jure Leskovec. 2012. Effects of user      ating predictive uncertainty, visual object classifica-
  similarity in social media. In WSDM ’12.                 tion, and recognising tectual entailment, pages 177–
                                                          190. Springer.
Yoav Artzi, Nicholas FitzGerald, and Luke S Zettle-
  moyer. 2013. Semantic parsing with combinatory        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
  categorial grammars. ACL (Tutorial Abstracts), 3.        Kristina Toutanova. 2019. Bert: Pre-training of
                                                           deep bidirectional transformers for language under-
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,        standing. In Proceedings of the 2019 Conference of
  Christian Federmann, Mark Fishel, Yvette Gra-            the North American Chapter of the Association for
  ham, Barry Haddow, Matthias Huck, Philipp Koehn,         Computational Linguistics: Human Language Tech-
  Shervin Malmasi, Christof Monz, Mathias Müller,          nologies, Volume 1 (Long and Short Papers), pages
  Santanu Pal, Matt Post, and Marcos Zampieri. 2019.       4171–4186.
                                                    4864
                                                     9
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.         the Association for Computational Linguistics: Hu-
 Question answering as an automatic evaluation met-        man Language Technologies, Volume 1 (Long and
 ric for news article summarization. In Proceed-           Short Papers), pages 1689–1701.
 ings of the 2019 Conference of the North American
 Chapter of the Association for Computational Lin-      Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
 guistics: Human Language Technologies, Volume 1          Yejin Choi. 2020. The curious case of neural text
 (Long and Short Papers), pages 3938–3948.                degeneration. In ICLR. ICLR.

Michael C Frank and Noah D Goodman. 2012. Pre-          Dirk Hovy and Shannon L Spruit. 2016. The social
  dicting pragmatic reasoning in language games. Sci-     impact of natural language processing. In Proceed-
  ence, 336(6084):998–998.                                ings of the 54th Annual Meeting of the Association
                                                          for Computational Linguistics (Volume 2: Short Pa-
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi,       pers), volume 2, pages 591–598.
 Alon Talmor, and Sewon Min. 2019. On making
 reading comprehension more comprehensive. In           Aaron Jaech, Victoria Zayats, Hao Fang, Mari Osten-
 Proceedings of the 2nd Workshop on Machine Read-         dorf, and Hannaneh Hajishirzi. 2015. Talking to the
 ing for Question Answering, pages 105–112, Hong          crowd: What do people react to in online discus-
 Kong, China. Association for Computational Lin-          sions? In EMNLP.
 guistics.
                                                        Satwik Kottur, José Moura, Stefan Lee, and Dhruv
Samuel Gehman, Suchin Gururangan, Maarten Sap,             Batra. 2017. Natural language does not emerge
  Yejin Choi, and Noah A Smith. 2020. Realtoxici-         ‘naturally’ in multi-agent dialog. In Proceedings
  typrompts: Evaluating neural toxic degeneration in       of the 2017 Conference on Empirical Methods in
  language models. In Proceedings of the 2020 Con-        Natural Language Processing, pages 2962–2967,
  ference on Empirical Methods in Natural Language         Copenhagen, Denmark. Association for Computa-
  Processing: Findings, pages 3356–3369.                   tional Linguistics.
Dave Golland, Percy Liang, and Dan Klein. 2010. A       Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
  game-theoretic approach to generating spatial de-      Cann, Caiming Xiong, and Richard Socher. 2019.
  scriptions. In Proceedings of the 2010 conference      Neural text summarization: A critical evaluation. In
  on empirical methods in natural language process-      Proceedings of the 2019 Conference on Empirical
  ing, pages 410–419. Association for Computational      Methods in Natural Language Processing and the
  Linguistics.                                           9th International Joint Conference on Natural Lan-
                                                         guage Processing (EMNLP-IJCNLP), pages 540–
Venkata Subrahmanyan Govindarajan, Benjamin Chen,
                                                         551, Hong Kong, China. Association for Computa-
  Rebecca Warholic, Katrin Erk, and Junyi Jessy Li.
                                                         tional Linguistics.
  2020. Help! need advice on identifying advice. In
  Proceedings of the 2020 Conference on Empirical       Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
  Methods in Natural Language Processing (EMNLP),         field, Michael Collins, Ankur Parikh, Chris Alberti,
  pages 5295–5306.                                        Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
Ben Green. 2019. “good” isn’t good enough. In Pro-        Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
  ceedings of the AI for Social Good workshop at          Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
  NeurIPS.                                                Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
                                                          ral questions: a benchmark for question answering
Ben Green and Salomé Viljoen. 2020. Algorithmic           research. Transactions of the Association of Compu-
  realism: Expanding the boundaries of algorithmic        tational Linguistics.
  thought. In Proceedings of the ACM Conference on
  Fairness, Accountability, and Transparency (FAT*).    Himabindu Lakkaraju, Julian J. McAuley, and Jure
                                                          Leskovec. 2013. What’s in a name? understanding
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,      the interplay between titles, content, and communi-
  Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P      ties in social media. In ICWSM.
  Bigham. 2018. Vizwiz grand challenge: Answering
  visual questions from blind people. In Proceedings    Angeliki Lazaridou, Alexander Peysakhovich, and
  of the IEEE Conference on Computer Vision and Pat-      Marco Baroni. 2017. Multi-agent cooperation and
  tern Recognition, pages 3608–3617.                      the emergence of (natural) language. ICLR.

Suchin Gururangan, Swabha Swayamdipta, Omer             Ronan Le Bras, Swabha Swayamdipta, Chandra Bha-
  Levy, Roy Schwartz, Samuel R. Bowman, and               gavatula, Rowan Zellers, Matthew E. Peters, Ashish
  Noah A. Smith. 2018. Annotation artifacts in nat-       Sabharwal, and Yejin Choi. 2020. Adversarial filters
  ural language inference data. In Proc. of NAACL.        of dataset biases. ArXiv, abs/2002.04108.

Tatsunori Hashimoto, Hugh Zhang, and Percy Liang.       Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
  2019. Unifying human and statistical evaluation for     Marcinkiewicz. 1993. Building a large annotated
  natural language generation. In Proceedings of the      corpus of English: The Penn Treebank. Computa-
  2019 Conference of the North American Chapter of        tional Linguistics, 19(2):313–330.
                                                    4865
                                                     10
James L McClelland, Felix Hill, Maja Rudolph, Ja-           Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
  son Baldridge, and Hinrich Schütze. 2019. Ex-               and Nanyun Peng. 2019. The woman worked as
  tending machine language models toward human-               a babysitter: On biases in language generation. In
  level language understanding.    arXiv preprint             Proceedings of the 2019 Conference on Empirical
  arXiv:1912.05877.                                           Methods in Natural Language Processing and the
                                                              9th International Joint Conference on Natural Lan-
Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. So-        guage Processing (EMNLP-IJCNLP), pages 3407–
  cial influence bias: a randomized experiment. Sci-          3412, Hong Kong, China. Association for Computa-
  ence, 341 6146:647–51.                                      tional Linguistics.
Fernando Pereira. 2000. Formal grammar and informa-         Naftali Tishby, Fernando C. Pereira, and William
  tion theory: together again? Philosophical Trans-           Bialek. 1999. The information bottleneck method.
  actions of the Royal Society of London. Series A:           In Proc. of the 37-th Annual Allerton Conference
  Mathematical, Physical and Engineering Sciences,            on Communication, Control and Computing, pages
  358(1769):1239–1253.                                        368–377.
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,          Antonio Torralba and Alexei A Efros. 2011. Unbiased
  Olga Uryupina, and Yuchen Zhang. 2012. Conll-               look at dataset bias. In Computer Vision and Pat-
  2012 shared task: Modeling multilingual unre-               tern Recognition (CVPR), 2011 IEEE Conference
  stricted coreference in ontonotes. In Joint Confer-         on, pages 1521–1528. IEEE.
  ence on EMNLP and CoNLL-Shared Task, pages 1–
  40. Association for Computational Linguistics.            Alan M. Turing. 1950. Computing Machinery and In-
                                                              telligence. Mind, LIX(236):433–460.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,           Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
  Wei Li, and Peter J. Liu. 2019. Exploring the limits        non. 2020. Fill in the blanc: Human-free quality
  of transfer learning with a unified text-to-text trans-     estimation of document summaries. arXiv preprint
  former. arXiv e-prints.                                     arXiv:2002.09836.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Percy Liang. 2016. Squad: 100,000+ questions for          Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  machine comprehension of text. In Proceedings of            Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  the 2016 Conference on Empirical Methods in Natu-           Kaiser, and Illia Polosukhin. 2017. Attention is all
  ral Language Processing, pages 2383–2392.                   you need. In Proceedings of the 31st International
                                                              Conference on Neural Information Processing Sys-
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan                tems, pages 6000–6010. Curran Associates Inc.
 Le Bras, and Yejin Choi. 2019. Social iqa: Com-
 monsense reasoning about social interactions. In           Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei
 Proceedings of the 2019 Conference on Empirical              Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad,
 Methods in Natural Language Processing and the               Ming Cheng, Behnam Hedayatnia, Angeliki Met-
 9th International Joint Conference on Natural Lan-           allinou, et al. 2018. On evaluating and compar-
 guage Processing (EMNLP-IJCNLP), pages 4453–                 ing open domain dialog systems. arXiv preprint
 4463.                                                        arXiv:1801.03625.

Thomas Scialom, Sylvain Lamprier, Benjamin Pi-              Alex Wang and Kyunghyun Cho. 2019. Bert has a
  wowarski, and Jacopo Staiano. 2019. Answers                 mouth, and it must speak: Bert as a markov ran-
  unite! unsupervised metrics for reinforced summa-           dom field language model. In Proceedings of the
  rization models. In Proceedings of the 2019 Con-            Workshop on Methods for Optimizing and Evaluat-
  ference on Empirical Methods in Natural Language            ing Neural Language Generation, pages 30–36.
  Processing and the 9th International Joint Confer-
  ence on Natural Language Processing (EMNLP-               Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  IJCNLP), pages 3246–3256, Hong Kong, China. As-             Amanpreet Singh, Julian Michael, Felix Hill, Omer
  sociation for Computational Linguistics.                    Levy, and Samuel Bowman. 2019a. Superglue: A
                                                              stickier benchmark for general-purpose language un-
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.              derstanding systems. In H. Wallach, H. Larochelle,
  2018. Self-attention with relative position represen-       A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Gar-
  tations. In Proceedings of the 2018 Conference of           nett, editors, Advances in Neural Information Pro-
  the North American Chapter of the Association for           cessing Systems 32, pages 3261–3275. Curran Asso-
  Computational Linguistics: Human Language Tech-             ciates, Inc.
  nologies, Volume 2 (Short Papers), pages 464–468.
                                                            Alex Wang, Amanpreet Singh, Julian Michael, Felix
Noam Shazeer and Mitchell Stern. 2018. Adafactor:             Hill, Omer Levy, and Samuel R. Bowman. 2019b.
  Adaptive learning rates with sublinear memory cost.         GLUE: A multi-task benchmark and analysis plat-
  In International Conference on Machine Learning,            form for natural language understanding. In Pro-
  pages 4603–4611.                                            ceedings of ICLR.
                                                        4866
                                                         11
Sida I Wang, Percy Liang, and Christopher D Manning.
   2016. Learning language games through interaction.
   In Proceedings of the 54th Annual Meeting of the
  Association for Computational Linguistics (Volume
  1: Long Papers), pages 2368–2378.
Ludwig Wittgenstein. 1953. Philosophical Investiga-
  tions. Wiley-Blackwell.
Dani Yogatama, Cyprien de Masson d’Autume, Jerome
  Connor, Tomas Kocisky, Mike Chrzanowski, Ling-
  peng Kong, Angeliki Lazaridou, Wang Ling, Lei
  Yu, Chris Dyer, et al. 2019. Learning and evaluat-
  ing general linguistic intelligence. arXiv preprint
  arXiv:1901.11373.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and
  Yejin Choi. 2018. SWAG: A large-scale adversar-
  ial dataset for grounded commonsense inference. In
  Proceedings of the 2018 Conference on Empirical
  Methods in Natural Language Processing, pages 93–
  104, Brussels, Belgium. Association for Computa-
  tional Linguistics.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
  Farhadi, and Yejin Choi. 2019a. HellaSwag: Can
  a machine really finish your sentence?      In Pro-
  ceedings of the 57th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 4791–
  4800, Florence, Italy. Association for Computational
  Linguistics.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
  Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
  Yejin Choi. 2019b. Defending against neural fake
  news. In Advances in Neural Information Process-
  ing Systems 32.
Sharon Zhou, Mitchell Gordon, Ranjay Krishna,
  Austin Narcomey, Li Fei-Fei, and Michael Bernstein.
  2019. Hype: A benchmark for human eye percep-
  tual evaluation of generative models. In Advances
  in Neural Information Processing Systems, pages
  3444–3456.

                                                     4867
                                                      12
1
                                                                    10
Appendix                                                                 2
                                                                                             RedditAdvice situation
                                                                                             HellaSWAG
                                                                                                                                     RedditAdvice advice
                                                                    10                       Glue

                                                        Frequency
                                                                                             SuperGlue
We provide the following items in the appendix:                     10
                                                                         3

• Dataset filtering criteria (Section A)                            10
                                                                         4

• Baseline model details (Section B)                                     5
                                                                    10
• Computing statistical significance (Section C)                             0   250   500     750      1000     1250 0   200     400        600
                                                                                  Length (spaCy tokens)                   Length (spaCy tokens)
• Results from a different round of dynamic evalu-
  ation (Section D)                                      Figure 7: Length distribution of RedditAdvice, com-
• Miscellaneous analysis (Section E)                     pared with other common NLU benchmarks bench-
• Additional qualitative examples (Section F)            marks (HellaSWAG; Zellers et al. (2019a), GLUE;
  For more up-to-date information, visit the             Wang et al. (2019b), SuperGlue; Wang et al. (2019a)).
                                                         The examples in RedditAdvice are significantly longer,
project page and dynamic leaderboard at
                                                         representing highly complex situations.
rowanzellers.com/advice.

A     Dataset Filtering Criteria
                                                         f. Posts in some of the subreddits (Dating_Advice,
We discuss the criteria by which we extract situ-           Dating, Love, Marriage) is often in the form of
ations and advice, both for our dynamic dataset             tips and general suggestions, rather than situa-
RedditAdvice, as well as for our static training            tions. We skip any posts from these subreddits
dataset RedditAdvice2019.                                   that do not include a question mark.
                                                         g. We filter out posts that contain sensitive topics,
A.1    Dynamic Filtering Criteria for                       such as assault, suicide, and abuse.
       RedditAdvice                                      h. Last, we skip any post that in total is fewer than
                                                            128 spaCy tokens, or, longer than 1280 spaCy
We use the following selection criteria for retriev-
                                                            tokens.
ing situations, along with the top-scoring advice,
from Reddit. Using the Reddit API, we will loop              For a retrieved situation, we do the following to
through Reddit posts, which might contain valid           extract valid advice:
situations. We will perform several checks on the         a. Given a post that contains a valid situation,
post, to ensure that we can reliably extract a situa-         we order the comments from highest-to-lowest
tion from it, as well as a top-scoring piece of advice        scoring. We perform the following checks to
from the comments.                                            determine if we can extract valid advice. Once
   We do the following to retrieve situations:                we find valid advice, we will stop iterating.
a. We iterate through posts, which by sorting             b.  We skip any comment that was posted by a
    through the top posts, that were posted be-               moderator, the Reddit user who posted the orig-
    tween 36 hours ago and two weeks ago, on the              inal situation, or that was edited.
    following advice subreddits: Relationships,           c.  We skip any comment with a score of less than
    Advice, NeedAdvice, Dating_Advice, Dating,                20.
    Love, Marriage, InternetParents, TechSupport,         d.  We skip any comment that contains fewer than
    and LegalAdvice.                                          32 spaCy tokens.
b. We skip ‘update’ posts, in which a user refers         e.  One corner case is highly-scoring advice com-
    to an older situation that they posted, and ‘meta’        ments that refer implicitly to others. For in-
    posts, in which subreddit rules are discussed.            stance, a comment might say ‘You should lis-
c. We skip any post that has an HTML link, since              ten to the other commenters and...’ These refer-
    today’s models (presumably) would not be able             ences make sense inside a Reddit post, however,
    to visit such a link.                                     they are somewhat nonsensical when we pull
d. We skip any post with a score of less than 20.             the comment out of context. We thus skip any
e. We do our best to clean the text of the post.              comment that seems to refer to others.
    Many posts include valid situations, but are             Once we retrieve a situation, that has at least
    then edited to include updates that took place        one piece of valid advice, we are done - and we
    afterwards, in response to advice that was given. move on to the next situation. We loop over the top-
    These are typically delimited by dashed lines, scoring 1000 posts in total, and randomly select
    and the word EDIT or UPDATE.                          200 valid situations from this pool.
                                                      4868
                                                       13
You can also read