TuringAdvice: A Generative and Dynamic Evaluation of Language Use

Page created by Ryan Molina

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

                     Rowan Zellers♠ Ari Holtzman♠ Elizabeth Clark♠
                          Lianhui Qin♠ Ali Farhadi♠ Yejin Choi♠♥
      ♠
        Paul G. Allen School of Computer Science & Engineering, University of Washington
                             ♥
                               Allen Institute for Artificial Intelligence
                                   rowanzellers.com/advice

                       Abstract                                          I have to do a dissection for my high school
                                                                         class, but I’m distressed by dead animals.
                                                                         Last time we dissected an animal in class, I
     We propose TuringAdvice, a new challenge
                                                                         had a panic attack. I asked my teacher for
     task and dataset for language understanding                         another assignment, but she refused. I don't
     models. Given a written situation that a real                       want to play a 'victim' card, but I don't know
     person is currently facing, a model must gen-                       what to do. Help!
     erate helpful advice in natural language. Our
     evaluation framework tests a fundamental as-             I’d send a short email to the next
     pect of human language understanding: our                higher-up authority figure, ideally a
     ability to use language to resolve open-ended            counselor. Be forthright; it’s the
                                                              best approach when self-
     situations by communicating with each other.
                                                              advocating as a student.
                                                                                               Helpful
     Empirical results show that today’s models
     struggle at TuringAdvice, even multibillion pa-
                                                             Go to your teacher and say "I'm
     rameter models finetuned on 600k in-domain              asking you to do a project that
     training examples. The best model, a finetuned          requires me to see dead animals.

                                                                                                      T5
     T5, writes advice that is at least as helpful as        This is a dealbreaker." If she
     human-written advice in only 14% of cases; a            doesn’t concede, tell your principal
     much larger non-finetunable GPT3 model does             about your trauma.
                                                                                            Not helpful
     even worse at 4%. This low performance re-
     veals language understanding errors that are
     hard to spot outside of a generative setting,          Figure 1: TuringAdvice. Humans are natural experts at
     showing much room for progress.                        using language to successfully address situations that
                                                            arise, such as giving advice. We introduce a new frame-
                                                            work, dataset, and leaderboard to generatively evaluate
                                                            real-world language use. Today’s most powerful mod-
1    Introduction                                           els – which obtain near-human or superhuman perfor-
Language models today are getting ever-larger, and          mance on core NLP benchmarks for reading compre-
                                                            hension, natural language inference, and commonsense
are being trained on ever-increasing quantities of
                                                            reasoning – struggle with all of these capabilities when
text. For an immense compute cost, these models             generating advice, as highlighted in red.
like T5 (Raffel et al., 2019) and GPT3 (Brown et al.,
2020) show gains on a variety of standard NLP
benchmarks – often even outperforming humans.           ample - but not the only one possible - suggests that
   Yet, when a giant model like T5 generates lan- she send a short email to her guidance counselor.
guage, we observe clear gaps between machine-              On the other hand, not only is T5’s advice un-
level and human-level language understanding – helpful, it also reveals key misunderstandings of
even after it has been finetuned for the task at hand. the situation. It seems to believe that the student
Consider Figure 1, in which a woman asks for            is asking the teacher to do a class project involv-
advice. She is assigned to dissect an animal for        ing dead animals. This reading comprehension
her class project, but has extreme anxiety about        error is particularly strange, as T5 outperforms
dead animals – and her teacher refused to give her      humans on a variety of reading comprehension
another assignment. Humans can respond with             benchmarks. Others in the community have ob-
helpful advice, reflecting our unique ability of real- served similar issues, raising concerns about what
world language use: to communicate and tackle           today’s benchmark datasets measure (Yogatama
open-ended issues. The helpful advice in this ex- et al., 2019; Kryscinski et al., 2019; McClelland
                                                     4856
                                                      1
                         Proceedings of the 2021 Conference of the North American Chapter of the
               Association for Computational Linguistics: Human Language Technologies, pages 4856–4880
                            June 6–11, 2021. ©2021 Association for Computational Linguistics

et al., 2019; Gardner et al., 2019). incredibly challenging for NLP models. Today’s
We argue that there is a deep underlying issue: largest finetunable model, T5 with 11 billion param-
a gap between how humans use language in the eters, produces advice that is preferable to human-
real world, and what benchmarks today can mea- written advice 14.5% of the time – after being fine-
sure. Today’s dominant paradigm is to study static tuned on 600k examples. GPT3, an even larger
datasets, and to grade machines by the similarity of model with 175 billion parameters that was not re-
their output with predefined correct answers. For leased for finetuning, does even worse at 4%. Even
example, we score multiple choice exams by how more concerning, our evaluation finds that it often
often the correct answers are chosen, and evaluate generates hateful and toxic language.
generative tasks like machine translation by simi- We also study our task from the perspective of to-
larity with respect to correct translations. However, day’s standard ‘core’ NLP tasks. Broadly, we find
when we use language in the real world to com- that machines frequently confuse who is who, are
municate with each other – such as when we give self-contradictory, or seem to miss important world
advice, or teach a concept to someone – there is knowledge. However, these mistakes tend not to
rarely a universal correct answer to compare with, fall into the neat categories defined by standard
just a loose goal we want to achieve. task definitions. We address this by introducing di-
We introduce a framework to narrow this gap agnostic questions, which systematically measure
between benchmarks and real-world language use. these language understanding errors.
We propose to evaluate machines by their success In summary, our paper makes three contribu-
in using language to (1) communicate with humans tions. First, we introduce a new framework for
in (2) tackling complex, open-ended, real-world measuring language understanding through directly
situations. Our goal is a machine that, like a human, tackling real-world language problems. Second,
can generate language that is useful and helpful. we introduce TuringAdvice as a new challenge
Doing so necessarily requires a deep understanding for AI systems, along with a dynamic dataset and
of language and the world, as per a line of thought leaderboard. Third, we connect our task to exist-
that the complete meaning representation is one ing atomic NLP tasks, introducing a new setting
that suffices to complete a task (Artzi et al., 2013). that reveals where progress is still needed.
As a case-study of our framework, we introduce
TuringAdvice as a new grand challenge for AI sys-
2 Real World Language Use
tems. A machine reads a situation written by a We propose to evaluate machines by their success
person seeking advice, like Figure 1, and must then at real-world language use: using language to com-
write advice that is helpful to the advice-seeker. municate with a human, in response to a naturally
Like a Turing Test (Turing, 1950), we establish a occurring situation, in order to achieve a desired
simple condition required for a model to ‘pass’: outcome. This is how educators often measure (hu-
model-generated advice must be at least as helpful man) language understanding of a second language
to the advice-seeker as human-written advice. – by how well the learner can use the language
We make our challenge concrete by introducing (Council of Europe, 2001). Our approach is also
a new dataset, RedditAdvice, and accompanying inspired by Wittgenstein’s notion of semantics, that
leaderboard. We tie our dataset to the Reddit com- “meaning is use:” language is grounded in our de-
munity, which resolves two additional sources of sire to make sense of one another and cooperate to
bias. First, Reddit users are intrinsically motivated, meet our needs (Wittgenstein, 1953).
seeking advice about highly complex real issues As machines do not have humanlike needs or
– which past work suggests differ from hypotheti- desires, we propose to evaluate machines’ success
cal issues that crowd workers might come up with at a task by how well it serves a human who is
(e.g. Kwiatkowski et al., 2019; Gurari et al., 2018). interested in the outcome. For example, if a ma-
Second, we make our dataset dynamic, not static – chine orders food on my behalf, then I can evaluate
models are evaluated over Reddit situations posted it based on whether I enjoy the dish it ordered.
over the previous two weeks at the time of submis- Though this requires careful task selection in order
sion. Models therefore, like humans, must general- to make things feasible for current models, as we
ize to new situations and patterns of language. will show in Section 3, it results in a powerful and
Experimental results show that TuringAdvice is reliable human evaluation.
4857
2

2.1 Related work proxy tasks that are easy to evaluate, while (hope-
2.1.1 Pragmatics in NLP fully) correlating with the underlying true task. For
example, SWAG (Zellers et al., 2018) is a multiple-
Our evaluation relates to pragmatics in NLP, where
choice proxy task and dataset introduced to study
communication is modeled also through listeners
the true task of commonsense reasoning.
and speakers (Golland et al., 2010; Frank and Good-
However, there are gaps between datasets for
man, 2012). One approach is to introduce a com-
proxy tasks (e.g. multiple choice), and the core
munication game, with an explicit objective. For
tasks they seek to represent (e.g. commonsense
example, Wang et al. (2016) study a blocks world
reasoning), which we discuss in the next sections.
where humans give commands to a block-placing
machine. The machine is then graded on accuracy. 2.2 Can language use really be measured
Our proposed evaluation instead covers complex through correctness over proxy tasks?
everyday scenarios faced by a human, where the
When we reduce a complex language task to a
objective is to help them as much as possible.
simplified setup, with a small label space (like
Pragmatics can also be studied through machine-
multiple-choice classification), we run the risk of
machine communication; e.g., through emergent
introducing artifacts and biases: patterns that can
language (Lazaridou et al., 2017). Recent work
be exploited in the simplified setup, but that are not
uses pretrained question-answering models to eval-
representative of the true task (Gururangan et al.,
uate summarization models (Chen et al., 2018;
2018; Zellers et al., 2019a). Artifacts can enable
Scialom et al., 2019; Eyal et al., 2019; Vasilyev
machines to even outperform humans at the final
et al., 2020). However, ensuring that machines
benchmark, without solving the underlying task.
communicate in standard English is difficult, as
While the problem of artifacts has recently taken
there is usually a more efficient machine-language
the spotlight in the NLP community, partially be-
coding scheme for the task (Kottur et al., 2017).
cause large Transformers (Vaswani et al., 2017)
2.1.2 Two major approaches for evaluation excel at picking up on artifacts, there is a deeper
Today, we see two major approaches for NLP eval- underlying issue. One way to view simplified tasks
uation, which we discuss below. is that in order to correctly map inputs X to labels
Quality of generations. The first approach stud- Y, a machine must learn a set of attributes A that
ies generative tasks like chit-chat dialogue or story- are representative of the ‘true’ task. We can upper-
writing, and measures the inherent quality of gen- bound the information contained by A through the
erations, often through attributes such as “sensi- information bottleneck principle of Tishby et al.
bleness” and “specificity” (e.g., Venkatesh et al., (1999). An efficient model minimizes the follow-
2018; Hashimoto et al., 2019; Adiwardana et al., ing, for some β ą 0:
2020). This approach is orthogonal to ours: though
these attributes might be desirable, they are often min IpX; Aq ´ βIpA; Yq, (1)
ppa|xq
insufficient to guarantee success at a task.
Correctness. The second (and perhaps more where I is mutual information. In other words, the
common) approach is to evaluate models through model will learn attributes A that maximally com-
correctness over static datasets. For example, ma- press the inputs X (minimizing IpX; Aq), while also
chines can be graded by the similarity of their gen- remaining good predictors of the labels Y (max-
erated translation to correct translations,1 or, by imizing IpA; Yq). However, the label prediction
how often they choose the correct answer on a mul- term is bounded by the information (or entropy, H)
tiple choice exam. Many goal-oriented dialogue of the label space:
and semantics tasks are also evaluated in this way,
as a model is evaluated by whether it makes the IpA; Yq “ HpYq ´ HpY|Aq ď HpYq. (2)
correct API call, or produces a correct parse.
Since many language tasks cannot be evaluated Thus, for a task with a small label space, there
through correctness, researchers often introduce is no guarantee that a model will learn high-
information content attributes. Models are in fact
1
Models submitted to the 2019 Conference on Machine encouraged to overfit to dataset artifacts, and to
Translation were evaluated (by humans) on how well the
model’s translations agreed with either (1) human-written unlearn linguistically useful information that is not
translations, or, (2) original source text (Barrault et al., 2019). directly relevant to predicting Y (Pereira, 2000).
4858
3

An alternate approach is to make datasets harder (Bonaccio and Dalal, 2006). Thus, we as humans
adversarially, so as to have fewer artifacts (Zellers have inherent familiarity with the task, and what
et al., 2018, 2019a; Le Bras et al., 2020). However, it means for advice to be helpful – making it easy
it might be impossible to make a dataset with no to evaluate, as we later show empirically. More-
artifacts, or to know if one has been created. over, because there are many internet communities
Our proposal, to evaluate models by their real- devoted to advice-giving, training data is plentiful.
world language use, addresses the information bot- Second, the framework of advice-giving allows
tleneck issue in two ways. First, when we use us to study subtasks such as reading comprehen-
language in the real world, the mapping between sion and natural language inference (Section 5.3);
possible inputs and outputs is often highly complex. we argue both of these are needed to consistently
For example, the space of possible advice is vast, give good advice. Learning to recognize advice
and many pieces of advice might be equally helpful has recently been studied as an NLP task on its
given a situation. Second, we directly tackle lan- own (Govindarajan et al., 2020), though we are not
guage problems, without introducing a correctness- aware of past work in learning to generate advice.
based proxy that machines might overfit to.
3.1 RedditAdvice: A dynamic dataset for
2.3 Static datasets in a dynamic world evaluating advice
To evaluate performance on a real-world task by We propose to evaluate models dynamically,
means of a dataset, we (implicitly) assume that through new situations and advice that are posted
the dataset is a good representation of the world to Reddit. We call our dynamic dataset Reddit-
(Torralba and Efros, 2011). This might be question- Advice. Many of Reddit’s subcommunities (or
able when it comes to real-world language use, as ‘subreddits’) are devoted to asking for and giv-
static datasets necessarily capture historic patterns ing advice, with subreddits for legal, relationship,
of language. For instance, syntactic understand- and general life advice.2 During evaluation time,
ing is often evaluated using the Penn Treebank, we will retrieve new situations from Reddit as a
with news articles from 1989 (Marcus et al., 1993). new test set for models. Workers on Mechanical
However, the world is constantly evolving, along Turk then grade the model-written advice versus
with the language that we use. the Reddit-endorsed human-written advice.
To bridge this gap, we propose to evaluate ma-
chines by their interactions with humans in the 3.1.1 How advice-giving works on Reddit
present. Models therefore must learn to perform Suppose a Reddit user faces an issue that they are
the underlying language task, even for novel situa- seeking advice about. First, they write up situation
tions, rather than fitting to the historic distribution and post it to an advice-oriented subreddit. Users
of a fixed test set. We make this notion concrete then reply to the situation, offering advice.
in the next section, where we introduce a dynamic Importantly, any user can ‘upvote’ or ‘downvote’
dataset and leaderboard for evaluating advice. the advice as well as the situation itself - changing
its score slightly. Top-scoring advice is deemed by
3 TuringAdvice: a New Challenge for the wisdom of the crowd as being the most helpful.3
Natural Language Understanding
3.1.2 The ideal evaluation - through Reddit?
As a case study of our framework, we introduce
TuringAdvice, a new challenge task for AI systems In a sense, human advice-givers are ‘evaluated’ on
to test language understanding. The format is sim- Reddit by the score of their advice – representing
ple: given a situation expressed in natural language, how well their advice has been received by the
a machine must respond with helpful advice. To community. Similarly, the ideal model evaluation
pass the challenge, machine-written advice must might be to post advice on Reddit directly. If the
be at least as helpful to the advice-seeker as human- model writes helpful advice, it should be upvoted.
written advice, in aggregate. 2
We use advice from the following subreddits: Love,
We focus on advice for a few reasons. First, Relationships, Advice, NeedAdvice, Dating_Advice, Dating,
advice-giving is both an important and an everyday Marriage, InternetParents, TechSupport, and LegalAdvice.
3
This is somewhat of a simplification, as other factors also
task. People ask for and give advice in settings influence what gets upvoted (Anderson et al., 2012; Lakkaraju
as diverse as relationship advice and tech support et al., 2013; Muchnik et al., 2013; Jaech et al., 2015).
4859
4

Given: Situation Advice A Advice B We show an overview of our Mechanical Turk
task in Figure 2. A worker is given a situation and
1. Which piece of advice is more helpful?
two pieces of advice. One is the top-scoring ad-
Definitely A Slightly A Slightly B Definitely B
vice from Reddit, and the other is model-generated
advice; the worker is not told which is which.
2. How helpful is the worse advice (A) to the question-asker?
The worker first chooses the more helpful piece
Slightly helpful Not helpful Dangerous
of advice, then provides diagnostic information for
the less helpful advice – rating it Slightly helpful ,
3. Is Advice A worse 3. Could Advice A be applicable to
mainly due to its (and helpful in) a diﬀerent situation? Not helpful , or Dangerous . If the worse piece of
meaning, or its writing? advice was Slightly helpful , they choose whether
Meaning Writing Possibly helpful Never helpful
it is worse due to a Meaning problem or a
Writing problem . Otherwise, they choose if the
Figure 2: Crowdsourcing workflow. Mechanical Turk worse advice could be Possibly helpful in some
Workers are given a situation, and two pieces of advice. other situation, or Never helpful in any situation.
First, they choose which is more helpful (here, B). Sec- Three workers rate each model-situation pair,
ond, they rate the helpfulness of the worse advice (A);
and ratings are combined using a majority vote. We
last, they answer a diagnostic question.
follow best practices on Mechanical Turk, using a
qualification exam, paying workers at least $15 per
However, there is a significant ethical problem hour, and giving feedback to workers. Still, eval-
with this approach. The users who post advice uation is highly economical at $1.86 per example-
questions are real people, with real problems. A model pair, or roughly $400 per model evaluated.
user might read advice that was originally written
by a machine, think it was human-endorsed, and 3.2 A large static dataset for training
do something harmful as a result. For this reason, We present RedditAdvice2019, a large static
we take an alternate crowdsourcing approach. dataset for training advice-giving models. Because
today’s models have extreme reliance on data for
3.1.3 A crowdsourced, hybrid evaluation –
finetuning, we collect data that is in the exact same
through Mechanical Turk
format as RedditAdvice, yet we expand our selec-
We propose a hybrid approach for dynamic evaluation criteria, optimizing for recall rather than preci-
tion of models. While the situations, and reference sion (Supp A.2). In total, we extract 616k pieces
advice come from Reddit, we hire workers on Me- of advice, over 188k situations.
chanical Turk to rate the relative helpfulness of To mirror the dynamic nature of the evaluation,
machine-written advice. Not only is this format in which models are evaluated on situations posted
more ethical, it also lets us collect diagnostic rat- in 2020 and beyond, we split our dataset into static
ings, allowing us to quantitatively track the natural training and validation sets by date.4
language understanding errors made by machines.
We made our crowdsourcing task as fulfilling as 4 Experimental Results on RedditAdvice
possible - using popular situations from Reddit,
and pitching the work in terms of helping people. In this section, we report results from one round of
We received feedback from many workers that our dynamic evaluation on RedditAdvice. We evaluate
tasks were entertaining and fun, suggesting that our the following strong NLP models and baselines:
workers are to some degree intrinsically motivated. a. Rule-based: a templated system to give legal,
relationship, or life advice. The system first
3.1.4 Mechanical Turk annotation setup chooses randomly empathetic sentence from
In a single round of evaluation, we retrieve 200 ten choices, for example “I’m sorry you’re
popular Reddit situations that were posted in the facing this.” It then chooses a random piece
last two weeks. For each situation, we retrieve of advice that is loosely related to the situa-
the top-rated advice from Reddit, and generate one tion’s topic; we infer this from the subreddit
piece of advice per model. Workers on Mechanical the situation was posted on. For example, for
Turk then compare the helpfulness of the model- 4
Our training set contains 600k pieces of advice from July
generated advice with human-written advice, and 2009 to June 14, 2019; validation contains 8k from June 14 to
provide diagnostic ratings. July 9th 2019.
4860
5

% Frequency that model advice is preferred over best Reddit advice
100%
                                                                                                                   Rule-              1.5%        1.5%          2.0%*       12.0%*    38.5%*
       Model advice preferred                                                                                     Based
 80%                                                                                                             TF-IDF
                                                                                                                Retrieval                          0.0%         0.5%*       10.5%*    37.0%*

                                                                                          Reference model
 60%                                                                                                         GPT3-175B                                          0.5%*       10.5%*    37.0%*
                                                                               41.0%
                                                                                                                Grover-                                                     10.0%*    36.5%*
                                                                                                             Mega(1.5B)
       Reddit advice preferred

 40%
                                                                                                                 T5-11B         *: Significant with p < .01                           26.5%*
                                                           14.5%
 20%                                                                                                                              : Significant with p < .05
                                                                                                             Second-best
                                          4.0%     4.5%              4.0%                                   Reddit advice       Not significant
                                 2.5%
                                                                                                                            Rule-     TF-IDF     GPT3-175B      Grover-     T5-11B   Second-best
 0%                                                                                                                         Based    Retrieval                 Mega(1.5B)            Reddit advice
                                 Rule-    TF-IDF Grover- T5-11B GPT3-175B Second-best                                                             Compared model
                                 Based   Retrieval Mega(1.5B)             Reddit advice
                                                                                            Figure 4: Improvement (in absolute percentage %) be-
Figure 3: Helpfulness of models relative to top-scoring
                                                                                            tween pairs of models, along with statistical signifi-
Reddit advice. We show results over 200 shared situ-
                                                                                            cance from a paired t-test. The improvement of T5-11B
ations; we also show bootstrapped 95% confidence in-
                                                                                            over smaller models like Grover-Mega is highly statis-
tervals. Advice from the best-scoring model, T5-11B,
                                                                                            tically significant (10% gap, pă.01), while being far
is preferred 14.5% over top-scoring Reddit advice. We
                                                                                            worse than human performance. Our evaluation thus
also compare the second-top scoring piece of Reddit
                                                                                            meaningfully grades varying levels of performance.
advice, which scores 41% – worse than the best advice
(50% by definition), but better than any model.

     LegalAdvice the model might write “I’d suggest                                             prompt due to the length of situation-advice
     getting a lawyer immediately.”                                                             pairs; we instead mimic the formatting of a
b.   TF-IDF retrieval: for a new situation, we com-                                             website quoting from Reddit (Appendix B.5).
     pute its TF-IDF bag-of-word vector and use it                                          Last, to quantify the measurement error of our eval-
     to retrieve the most similar situation from the                                        uation, we additionally evaluate:
     training set. We then reply with the top-scoring                                       f. the second-highest rated Reddit advice for each
     advice for that situation.                                                                 situation. We send this advice through the same
c.   Grover-Mega (Zellers et al., 2019b): a left-to-                                            pipeline as machine-written advice.
     right transformer model with 1.5 billion pa-                                              We finetune all models (except GPT3) and gen-
     rameters. Grover was pretrained on news ar-                                            erate using Nucleus Sampling (Holtzman et al.,
     ticles with multiple fields, perhaps making it                                         2020); more details in Appendix B.
     a good fit for our task, with multiple fields of                                          In our study, we exclude purely bidirectional
     context (like the subreddit, date, and title). Our                                     models, such as BERT (Devlin et al., 2019). While
     situation-advice pairs are often quite long, so                                        these models can be made to generate text, these
     we adapt Grover for length; pretraining it on                                          generations are usually worse than those of left-to-
     sequences of up to 1536 characters.                                                    right models (Wang and Cho, 2019). T5 also tends
d.   T5 (Raffel et al., 2019): a sequence-to-                                               to outperform them, even on discriminative tasks.
     sequence model with a bidirectional encoder
     and a left-to-right generator, with 11 billion                                         4.1                    Quantitative results
     parameters. T5 was trained on a large dataset                                        In Figure 3, we show overall results for one evalua-
     of cleaned web text. At the time of writing,                                         tion trial, which featured 200 situations posted on
     T5 is the top-scoring model on the Glue and                                          Reddit from October 28 to November 7, 2020. As
     SuperGlue benchmarks (Wang et al., 2019b,a),                                         a key metric for measuring the relative usefulness
     scoring above human performance on Glue and                                          of model-written advice, we evaluate the frequency
     near human-performance on SuperGlue.                                                 by which workers prefer the Reddit-written refer-
e.   GPT3 (Brown et al., 2020): a left-to-right                                           ence advice over the model-written advice. If a
     transformer model with 175 billion parameters.                                       model’s advice was just as helpful as human advice
     GPT3 must be “prompted” to generate advice                                           in aggregate, then that model would score 50%.
     since it has not been released for finetuning.                                          Model performance is quite low. The best model,
     We cannot provide few-shot examples in the                                           T5-11B, scores 14.5%, outperforming a smaller
                                                                                          Grover-Mega (4.5%); GPT3 does worse at 4.0%.
                                                                                       4861
                                                                                        6

The rule-based and TF-IDF baselines are competi- legaladvice: NJ Work will not let us eat in the building due to
covid outbreak. We can’t leave the property for breaks. They
tive at 2.5% and 4.0% accuracy respectively. have a tent outside for us to eat in but it’s getting cold out.
As additional comparison to the 50% upper I am a manager at my work part of my responsibilities is to
bound, the second-highest scoring Reddit advice stay on the property during my entire shift we are given a 1
hour paid lunch break that we do not have to clock out for.
scores 41%. This suggest that our workers and However due to recent covid outbreak the new policy is no
often prefer the same advice as Reddit users. eating in the building. They had set up a tent outside for us
to eat in but it is starting to get cold out as well there are not
4.1.1 Measurement error heaters in there. My question is can they legally do this to us?
Also I work 4:00 p.m. to midnight shift.
To investigate the measurement error of our evalu-
Top So by "clock out for" you are not a salaried exempt
ation, in Figure 4 we report the statistical signifi- Reddit employee? Do the workers under you have a union?
cance between pairs of models; details about how advice Aside from the cold, New Jersey doesn’t require em-
this is computed are in Appendix C. We observe ployers to give *any* meal breaks for employees 18
or older, so they could take away the breaks and the
a large gap in performance between T5 and the tent if they wanted, and just say "no eating" (though
other baselines. For example, its improvement over water is required). What is your relationship with the
Grover-Mega is 10%, which is highly statistically other managers, and room to figure this out, ask for
heaters, etc.?
signficant. On the other hand, the differences in
Rule If I were you, I’d get a lawyer immediately.
performance between other models are more minor Based De nitely less helpful Slightly helpful
– GPT3 does not outperform TF-IDF, and though Meaning problem
it outperforms the rule-based system by 1.5%, it is TF- You were blackout drunk, which means that you were
only somewhat statistically significant. IDF not capable of consent. . . . If anyone should feel vio-
Re- lated, it’s you.
Overall, the statistical significance results sug- trieval De nitely less helpful Not helpful Possibly helpful
gest that our evaluation can stably rank model per- Grover- If you are required to be on the premises, they are
formance. This, along with the finding that model Mega required to provide you with a place to eat.
https://www.nj.gov/labor/es/faq.html#13
performance is low on our task suggests that there
De nitely less helpful Not helpful Possibly helpful
is ample room for growth on RedditAdvice. T5- If your shift is from 4pm to midnight, is there a reason
11B that you can’t leave the property?
5 Analysis and Discussion De nitely less helpful Not helpful Never helpful
GPT3- A place of employment is considered the "premises".
So far, we have shown that we are able to reli- 175B You can’t just assume that because you can’t walk out
ably evaluate models in our dynamic setup, and the door that you can eat wherever you want. This is
work hours. You’re not in your home. The restaurant,
that doing so results in model performance that is or at least the food, is all provided for you for your
significantly lower than human performance. benefit. But if there are other people there, then I
To break down what this gap in performance could see how it could be a safety issue.
De nitely less helpful Not helpful Never helpful
means, we show a qualitative example in Figure 5.
A user is asking for online legal advice about being Figure 5: Qualitative example; more in Supp. F.
stuck at work for their entire 4pm-midnight shift – Though machine-generated advice matches keywords
with no eating allowed due to COVID-19. The top- from the situation, it is frequently not helpful or even
rated Reddit advice understands this situation and self-contradictory. The issues are due to critical er-
then offers advice, suggesting the advice-seeker rors in natural language understanding, such as reading
comprehension, entailment, and coreference.
organize alongside other workers – as “New Jersey
doesn’t require employers to give any meal breaks.”
Machine advice seems to misunderstand the is- better, and then annotate problems with the worse
sue. T5 asks if there is “a reason that you can’t piece of advice. We found workers had high agree-
leave the property,” even though this reason is ment during the diagnostic annotation.5
stated in the situation. GPT3’s advice is self-
In Figure 6, we show the distribution of
contradictory; it also makes something up about a
the ratings for model-written, versus human-
“restaurant” in the building.
written advice. Machine-written advice that was
5.1 Problems with machine-written advice 5
For the classifying machine-written advice as ‘helpful’
As part of our evaluation, we wish to quantita- versus ‘not helpful’ or ‘dangerous’ (combining the two latter
tively measure problems with machine-written ad- categories into one), we have κ“0.689. For breaking down
helpful advice into ‘meaning problem’ versus a ‘writing prob-
vice. Recall that in our crowdsourcing setup (Sec- lem’, we have Cohen’s κ“0.613; for rating unhelpful advice
tion 3.1.3), three workers select which advice is as ‘possibly helpful’ versus ‘never helpful,’ we have κ“0.602.
4862
7

100%
                                                                         Frequency (%) of advice ratings
 80%                                     Preferred over top-rated Reddit advice                      Not helpful (possibly helpful elsewhere)
                      66%                Slightly helpful (with a writing problem)                   Not helpful (never helpful elsewhere)
 60%                                     Slightly helpful (with a meaning problem)                   Dangerous
 40%                                                                                                          41%
                                                         33%                                                        32%
 20%                                               23%                         26% 22%
                                                                                         21%                              19%
                                         15% 13%                         14%                   13%
             9% 5%        10%                                  10%
  0%      4%                  4%    4%                                                               3%                       4% 2% 1%
                 TF-IDF                    GPT3-175B                               T5-11B                            Second-best
                Retrieval                                                                                            Reddit advice

Figure 6: Distribution of ratings for three models: TF-IDF retrieval, GPT3, and T5, along with ratings for the
second-best rated Reddit advice. Though deep generators like GPT3 and T5 are often preferred over the retrieval
baseline, they also often write advice that would never be helpful (33% GPT3, 13% T5), and that is racist, sexist,
or otherwise dangerous (10% GPT3, 3% T5).

not preferred over human-written advice can                          3.1.4), which we authors will pay in the short term.
have the following ratings. It can be rated as                       An alternative strategy requires submitters to pay
 Slightly helpful (but, was rated as worse mainly                    the Mechanical Turk fees themselves; this model
due to a Meaning problem or Writing problem ),                       was used for the HYPE leaderboard in computer
as Not helpful , or Dangerous .                                      vision (Zhou et al., 2019).
   The diagnostics show several patterns. First, all
models frequently commit natural language under-                     5.3    Relation to existing NLP tasks
standing errors, such as internal contradiction. Be-                 Shared “core” tasks such as reading comprehension
cause of this, we find that TF-IDF bag-of-words                      and natural language inference are of considerable
retrieval is competitive with that of large generators.              interest to the NLP community. Many datasets
While retrieved advice is often irrelevant (66% of                   have been proposed for these tasks, and progress
the time), it is almost never complete gibberish, as                 on them is often measured through auto-gradeable
it comes from top-scoring advice. Only 10% of                        correctness metrics. However, large models have
workers rated this advice as Not helpful for any                     started to outperform humans on these datasets,
situation, less than T5.                                             raising doubt that further progress on them brings
   Second, they suggest that models struggle                         us closer to human-level language understanding.
even more without finetuning. A GPT3 model                              We argue two things: first, that many NLP tasks
with careful prompting generates language that is                    are necessary components of giving advice, and sec-
 Dangerous 10% of the time. These qualitative                        ond, that because giving advice remains far from
and quantitative results confirm a pattern observed                  solved, these tasks are also far from solved. In
by many others, that large language models like                      Appendix F, we study problems with advice from
GPT3 often generate explicitly racist and sexist lan-                T5-11B from the point of view of existing NLP
guage out-of-the-box Sheng et al., 2019; Gehman                      tasks. For instance, machine advice often contra-
et al., 2020; Bender et al., 2021, among others).                    dicts itself, suggesting that today’s systems struggle
We explore this further in Supplemental F. This is                   with the general task of natural language inference.
perhaps worrying, since GPT3 is presently being                      We have made these diagnostics publicly available
commercialized.                                                      to enable progress on automatically spotting these
                                                                     mistakes.
5.2    A Leaderboard for Advice Evaluation
                                                                     6     Conclusion; Ethical Considerations
So far, we have shown results from one evaluation
round; a second is in Supplemental D. We propose      We introduced new methodology for evaluating lan-
a dynamic leaderboard to keep that evaluation on- guage tasks, reducing the gap between benchmarks
going, at rowanzellers.com/advice.                    and the real world. We also introduced a new chal-
   Users submit a model API to be dynamically         lenge for the community, TuringAdvice, with an
evaluated. Each new model, along with the highest     accompanying dataset and dynamic leaderboard.
rated previously-evaluated model, will be evaluated      Yet, if our field is to progress towards NLP mod-
for an additional round using the same approach. els that ‘understand natural language,’ we should
The cost of each evaluation is reasonable (Section    be cognizant of the impact that such technology
                                                   4863
                                                    8

might have on society. In this paper, we presented Findings of the 2019 conference on machine transla-
a sketch of NLP models helping people who need tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared
advice on sensitive topics, which could be a mea-
Task Papers, Day 1), pages 1–61, Florence, Italy. As-
surable goal for the field. sociation for Computational Linguistics.
At the same time, we do not claim that our ap-
proach is a panacea. There are almost certainly Emily M Bender, Timnit Gebru, Angelina McMillan-
better non-technical solutions to ensure mentorship Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language models
and legal advice for all (Green, 2019). Moreover, be too big? In Proceedings of the 2021 ACM Confer-
there are significant dual-use risks with models ence on Fairness, Accountability, and Transparency,
that understand language (Hovy and Spruit, 2016; pages 610–623.
Green and Viljoen, 2020). Our evaluation measures
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian-
some risks of generative models – such as the ten-
feng Gao, and Yejin Choi. 2020. Piqa: Reasoning
dency to generate toxic language – but more work about physical commonsense in natural language. In
in this area is needed. Thirty-Fourth AAAI Conference on Artificial Intelli-
gence.
Acknowledgements
Silvia Bonaccio and Reeshad S. Dalal. 2006. Advice
Thanks to the Reddit users who participate in its taking and decision-making: An integrative litera-
ture review, and implications for the organizational
advice subreddits – from asking for help, to writ- sciences. Organizational Behavior and Human De-
ing (and voting on) helpful advice. Thanks to cision Processes, 101(2):127 – 151.
the Mechanical Turk workers who performed the
annotation for our experiments. Thanks also to Samuel R. Bowman, Gabor Angeli, Christopher Potts,
the three anonymous reviewers, along with Katha- and Christopher D. Manning. 2015. A large an-
notated corpus for learning natural language infer-
rina Reinecke, Oren Etzioni, Hannah Rashkin, ence. In Proceedings of the 2015 Conference on
Maarten Sap, Maxwell Forbes, Jesse Thoma- Empirical Methods in Natural Language Processing,
son, Daniel Khashabi, Gabriel Ilharco, Swabha EMNLP 2015, Lisbon, Portugal, September 17-21,
Swayamdipta, and Yonatan Bisk, for feedback. 2015, pages 632–642.
This research was supported in part by NSF (IIS- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
1524371, IIS-1714566), DARPA under the CwC Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
program through the ARO (W911NF-15-1-0543), Neelakantan, Pranav Shyam, Girish Sastry, Amanda
DARPA under the MCS program through NIWC Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165.
Pacific (N66001-19-2-4031), and the NSF-GRFP
No. DGE-1256082. Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018.
A semantic qa-based approach for text summariza-
tion evaluation. In Thirty-Second AAAI Conference
References on Artificial Intelligence.

Daniel Adiwardana, Minh-Thang Luong, David R So, Council of Europe. 2001. Common European Frame-
Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, work of Reference for Languages: learning, teach-
Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, ing, assessment. Cambridge University Press.
et al. 2020. Towards a human-like open-domain
chatbot. arXiv preprint arXiv:2001.09977. Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006. The pascal recognising textual entailment
Ashton Anderson, Daniel P. Huttenlocher, Jon M. challenge. In Machine learning challenges. evalu-
Kleinberg, and Jure Leskovec. 2012. Effects of user ating predictive uncertainty, visual object classifica-
similarity in social media. In WSDM ’12. tion, and recognising tectual entailment, pages 177–
190. Springer.
Yoav Artzi, Nicholas FitzGerald, and Luke S Zettle-
moyer. 2013. Semantic parsing with combinatory Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
categorial grammars. ACL (Tutorial Abstracts), 3. Kristina Toutanova. 2019. Bert: Pre-training of
deep bidirectional transformers for language under-
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, standing. In Proceedings of the 2019 Conference of
Christian Federmann, Mark Fishel, Yvette Gra- the North American Chapter of the Association for
ham, Barry Haddow, Matthias Huck, Philipp Koehn, Computational Linguistics: Human Language Tech-
Shervin Malmasi, Christof Monz, Mathias Müller, nologies, Volume 1 (Long and Short Papers), pages
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. 4171–4186.
4864
9

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. the Association for Computational Linguistics: Hu-
Question answering as an automatic evaluation met- man Language Technologies, Volume 1 (Long and
ric for news article summarization. In Proceed- Short Papers), pages 1689–1701.
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
guistics: Human Language Technologies, Volume 1 Yejin Choi. 2020. The curious case of neural text
(Long and Short Papers), pages 3938–3948. degeneration. In ICLR. ICLR.

Michael C Frank and Noah D Goodman. 2012. Pre- Dirk Hovy and Shannon L Spruit. 2016. The social
dicting pragmatic reasoning in language games. Sci- impact of natural language processing. In Proceed-
ence, 336(6084):998–998. ings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Pa-
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, pers), volume 2, pages 591–598.
Alon Talmor, and Sewon Min. 2019. On making
reading comprehension more comprehensive. In Aaron Jaech, Victoria Zayats, Hao Fang, Mari Osten-
Proceedings of the 2nd Workshop on Machine Read- dorf, and Hannaneh Hajishirzi. 2015. Talking to the
ing for Question Answering, pages 105–112, Hong crowd: What do people react to in online discus-
Kong, China. Association for Computational Lin- sions? In EMNLP.
guistics.
Satwik Kottur, José Moura, Stefan Lee, and Dhruv
Samuel Gehman, Suchin Gururangan, Maarten Sap, Batra. 2017. Natural language does not emerge
Yejin Choi, and Noah A Smith. 2020. Realtoxici- ‘naturally’ in multi-agent dialog. In Proceedings
typrompts: Evaluating neural toxic degeneration in of the 2017 Conference on Empirical Methods in
language models. In Proceedings of the 2020 Con- Natural Language Processing, pages 2962–2967,
ference on Empirical Methods in Natural Language Copenhagen, Denmark. Association for Computa-
Processing: Findings, pages 3356–3369. tional Linguistics.
Dave Golland, Percy Liang, and Dan Klein. 2010. A Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
game-theoretic approach to generating spatial de- Cann, Caiming Xiong, and Richard Socher. 2019.
scriptions. In Proceedings of the 2010 conference Neural text summarization: A critical evaluation. In
on empirical methods in natural language process- Proceedings of the 2019 Conference on Empirical
ing, pages 410–419. Association for Computational Methods in Natural Language Processing and the
Linguistics. 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 540–
Venkata Subrahmanyan Govindarajan, Benjamin Chen,
551, Hong Kong, China. Association for Computa-
Rebecca Warholic, Katrin Erk, and Junyi Jessy Li.
tional Linguistics.
2020. Help! need advice on identifying advice. In
Proceedings of the 2020 Conference on Empirical Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
Methods in Natural Language Processing (EMNLP), field, Michael Collins, Ankur Parikh, Chris Alberti,
pages 5295–5306. Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
Ben Green. 2019. “good” isn’t good enough. In Pro- Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
ceedings of the AI for Social Good workshop at Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
NeurIPS. Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering
Ben Green and Salomé Viljoen. 2020. Algorithmic research. Transactions of the Association of Compu-
realism: Expanding the boundaries of algorithmic tational Linguistics.
thought. In Proceedings of the ACM Conference on
Fairness, Accountability, and Transparency (FAT*). Himabindu Lakkaraju, Julian J. McAuley, and Jure
Leskovec. 2013. What’s in a name? understanding
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, the interplay between titles, content, and communi-
Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P ties in social media. In ICWSM.
Bigham. 2018. Vizwiz grand challenge: Answering
visual questions from blind people. In Proceedings Angeliki Lazaridou, Alexander Peysakhovich, and
of the IEEE Conference on Computer Vision and Pat- Marco Baroni. 2017. Multi-agent cooperation and
tern Recognition, pages 3608–3617. the emergence of (natural) language. ICLR.

Suchin Gururangan, Swabha Swayamdipta, Omer Ronan Le Bras, Swabha Swayamdipta, Chandra Bha-
Levy, Roy Schwartz, Samuel R. Bowman, and gavatula, Rowan Zellers, Matthew E. Peters, Ashish
Noah A. Smith. 2018. Annotation artifacts in nat- Sabharwal, and Yejin Choi. 2020. Adversarial filters
ural language inference data. In Proc. of NAACL. of dataset biases. ArXiv, abs/2002.04108.

Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
2019. Unifying human and statistical evaluation for Marcinkiewicz. 1993. Building a large annotated
natural language generation. In Proceedings of the corpus of English: The Penn Treebank. Computa-
2019 Conference of the North American Chapter of tional Linguistics, 19(2):313–330.
4865
10

James L McClelland, Felix Hill, Maja Rudolph, Ja- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
son Baldridge, and Hinrich Schütze. 2019. Ex- and Nanyun Peng. 2019. The woman worked as
tending machine language models toward human- a babysitter: On biases in language generation. In
level language understanding. arXiv preprint Proceedings of the 2019 Conference on Empirical
arXiv:1912.05877. Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. So- guage Processing (EMNLP-IJCNLP), pages 3407–
cial influence bias: a randomized experiment. Sci- 3412, Hong Kong, China. Association for Computa-
ence, 341 6146:647–51. tional Linguistics.
Fernando Pereira. 2000. Formal grammar and informa- Naftali Tishby, Fernando C. Pereira, and William
tion theory: together again? Philosophical Trans- Bialek. 1999. The information bottleneck method.
actions of the Royal Society of London. Series A: In Proc. of the 37-th Annual Allerton Conference
Mathematical, Physical and Engineering Sciences, on Communication, Control and Computing, pages
358(1769):1239–1253. 368–377.
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Antonio Torralba and Alexei A Efros. 2011. Unbiased
Olga Uryupina, and Yuchen Zhang. 2012. Conll- look at dataset bias. In Computer Vision and Pat-
2012 shared task: Modeling multilingual unre- tern Recognition (CVPR), 2011 IEEE Conference
stricted coreference in ontonotes. In Joint Confer- on, pages 1521–1528. IEEE.
ence on EMNLP and CoNLL-Shared Task, pages 1–
40. Association for Computational Linguistics. Alan M. Turing. 1950. Computing Machinery and In-
telligence. Mind, LIX(236):433–460.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
Wei Li, and Peter J. Liu. 2019. Exploring the limits non. 2020. Fill in the blanc: Human-free quality
of transfer learning with a unified text-to-text trans- estimation of document summaries. arXiv preprint
former. arXiv e-prints. arXiv:2002.09836.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
machine comprehension of text. In Proceedings of Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
the 2016 Conference on Empirical Methods in Natu- Kaiser, and Illia Polosukhin. 2017. Attention is all
ral Language Processing, pages 2383–2392. you need. In Proceedings of the 31st International
Conference on Neural Information Processing Sys-
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan tems, pages 6000–6010. Curran Associates Inc.
Le Bras, and Yejin Choi. 2019. Social iqa: Com-
monsense reasoning about social interactions. In Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei
Proceedings of the 2019 Conference on Empirical Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad,
Methods in Natural Language Processing and the Ming Cheng, Behnam Hedayatnia, Angeliki Met-
9th International Joint Conference on Natural Lan- allinou, et al. 2018. On evaluating and compar-
guage Processing (EMNLP-IJCNLP), pages 4453– ing open domain dialog systems. arXiv preprint
4463. arXiv:1801.03625.

Thomas Scialom, Sylvain Lamprier, Benjamin Pi- Alex Wang and Kyunghyun Cho. 2019. Bert has a
wowarski, and Jacopo Staiano. 2019. Answers mouth, and it must speak: Bert as a markov ran-
unite! unsupervised metrics for reinforced summa- dom field language model. In Proceedings of the
rization models. In Proceedings of the 2019 Con- Workshop on Methods for Optimizing and Evaluat-
ference on Empirical Methods in Natural Language ing Neural Language Generation, pages 30–36.
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP- Alex Wang, Yada Pruksachatkun, Nikita Nangia,
IJCNLP), pages 3246–3256, Hong Kong, China. As- Amanpreet Singh, Julian Michael, Felix Hill, Omer
sociation for Computational Linguistics. Levy, and Samuel Bowman. 2019a. Superglue: A
stickier benchmark for general-purpose language un-
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. derstanding systems. In H. Wallach, H. Larochelle,
2018. Self-attention with relative position represen- A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Gar-
tations. In Proceedings of the 2018 Conference of nett, editors, Advances in Neural Information Pro-
the North American Chapter of the Association for cessing Systems 32, pages 3261–3275. Curran Asso-
Computational Linguistics: Human Language Tech- ciates, Inc.
nologies, Volume 2 (Short Papers), pages 464–468.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Noam Shazeer and Mitchell Stern. 2018. Adafactor: Hill, Omer Levy, and Samuel R. Bowman. 2019b.
Adaptive learning rates with sublinear memory cost. GLUE: A multi-task benchmark and analysis plat-
In International Conference on Machine Learning, form for natural language understanding. In Pro-
pages 4603–4611. ceedings of ICLR.
4866
11

Sida I Wang, Percy Liang, and Christopher D Manning.
   2016. Learning language games through interaction.
   In Proceedings of the 54th Annual Meeting of the
  Association for Computational Linguistics (Volume
  1: Long Papers), pages 2368–2378.
Ludwig Wittgenstein. 1953. Philosophical Investiga-
  tions. Wiley-Blackwell.
Dani Yogatama, Cyprien de Masson d’Autume, Jerome
  Connor, Tomas Kocisky, Mike Chrzanowski, Ling-
  peng Kong, Angeliki Lazaridou, Wang Ling, Lei
  Yu, Chris Dyer, et al. 2019. Learning and evaluat-
  ing general linguistic intelligence. arXiv preprint
  arXiv:1901.11373.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and
  Yejin Choi. 2018. SWAG: A large-scale adversar-
  ial dataset for grounded commonsense inference. In
  Proceedings of the 2018 Conference on Empirical
  Methods in Natural Language Processing, pages 93–
  104, Brussels, Belgium. Association for Computa-
  tional Linguistics.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
  Farhadi, and Yejin Choi. 2019a. HellaSwag: Can
  a machine really finish your sentence?      In Pro-
  ceedings of the 57th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 4791–
  4800, Florence, Italy. Association for Computational
  Linguistics.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
  Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
  Yejin Choi. 2019b. Defending against neural fake
  news. In Advances in Neural Information Process-
  ing Systems 32.
Sharon Zhou, Mitchell Gordon, Ranjay Krishna,
  Austin Narcomey, Li Fei-Fei, and Michael Bernstein.
  2019. Hype: A benchmark for human eye percep-
  tual evaluation of generative models. In Advances
  in Neural Information Processing Systems, pages
  3444–3456.

                                                     4867
                                                      12

1
10
Appendix 2
RedditAdvice situation
HellaSWAG
RedditAdvice advice
10 Glue

Frequency
SuperGlue
We provide the following items in the appendix: 10
3

• Dataset filtering criteria (Section A) 10
4

• Baseline model details (Section B) 5
10
• Computing statistical significance (Section C) 0 250 500 750 1000 1250 0 200 400 600
Length (spaCy tokens) Length (spaCy tokens)
• Results from a different round of dynamic evalu-
ation (Section D) Figure 7: Length distribution of RedditAdvice, com-
• Miscellaneous analysis (Section E) pared with other common NLU benchmarks bench-
• Additional qualitative examples (Section F) marks (HellaSWAG; Zellers et al. (2019a), GLUE;
For more up-to-date information, visit the Wang et al. (2019b), SuperGlue; Wang et al. (2019a)).
The examples in RedditAdvice are significantly longer,
project page and dynamic leaderboard at
representing highly complex situations.
rowanzellers.com/advice.

A Dataset Filtering Criteria
f. Posts in some of the subreddits (Dating_Advice,
We discuss the criteria by which we extract situ- Dating, Love, Marriage) is often in the form of
ations and advice, both for our dynamic dataset tips and general suggestions, rather than situa-
RedditAdvice, as well as for our static training tions. We skip any posts from these subreddits
dataset RedditAdvice2019. that do not include a question mark.
g. We filter out posts that contain sensitive topics,
A.1 Dynamic Filtering Criteria for such as assault, suicide, and abuse.
RedditAdvice h. Last, we skip any post that in total is fewer than
128 spaCy tokens, or, longer than 1280 spaCy
We use the following selection criteria for retriev-
tokens.
ing situations, along with the top-scoring advice,
from Reddit. Using the Reddit API, we will loop For a retrieved situation, we do the following to
through Reddit posts, which might contain valid extract valid advice:
situations. We will perform several checks on the a. Given a post that contains a valid situation,
post, to ensure that we can reliably extract a situa- we order the comments from highest-to-lowest
tion from it, as well as a top-scoring piece of advice scoring. We perform the following checks to
from the comments. determine if we can extract valid advice. Once
We do the following to retrieve situations: we find valid advice, we will stop iterating.
a. We iterate through posts, which by sorting b. We skip any comment that was posted by a
through the top posts, that were posted be- moderator, the Reddit user who posted the orig-
tween 36 hours ago and two weeks ago, on the inal situation, or that was edited.
following advice subreddits: Relationships, c. We skip any comment with a score of less than
Advice, NeedAdvice, Dating_Advice, Dating, 20.
Love, Marriage, InternetParents, TechSupport, d. We skip any comment that contains fewer than
and LegalAdvice. 32 spaCy tokens.
b. We skip ‘update’ posts, in which a user refers e. One corner case is highly-scoring advice com-
to an older situation that they posted, and ‘meta’ ments that refer implicitly to others. For in-
posts, in which subreddit rules are discussed. stance, a comment might say ‘You should lis-
c. We skip any post that has an HTML link, since ten to the other commenters and...’ These refer-
today’s models (presumably) would not be able ences make sense inside a Reddit post, however,
to visit such a link. they are somewhat nonsensical when we pull
d. We skip any post with a score of less than 20. the comment out of context. We thus skip any
e. We do our best to clean the text of the post. comment that seems to refer to others.
Many posts include valid situations, but are Once we retrieve a situation, that has at least
then edited to include updates that took place one piece of valid advice, we are done - and we
afterwards, in response to advice that was given. move on to the next situation. We loop over the top-
These are typically delimited by dashed lines, scoring 1000 posts in total, and randomly select
and the word EDIT or UPDATE. 200 valid situations from this pool.
4868
13

You can also read