What Question Answering can Learn from Trivia Nerds

Page created by Terrence Zimmerman
 
CONTINUE READING
What Question Answering can Learn from Trivia Nerds
What Question Answering can Learn from Trivia Nerds

                                                                                               ,†
                                                               Jordan Boyd-Graber                   Benjamin Börschinger†
                                             iSchool, CS, UMIACS, LSC
                                                University of Maryland                                      † Google Research, Zürich
                                             jbg@umiacs.umd.edu                                     {jbg, bboerschinger}@google.com

                                                              Abstract                              ated a community that reliably identifies those best
                                                                                                    at question answering. Beyond the format of the
                                             In addition to machines answering questions,
                                                                                                    competition, important safeguards ensure individ-
                                             question answering (QA) research creates in-
arXiv:1910.14464v3 [cs.CL] 21 Apr 2020

                                             teresting, challenging questions that reveal           ual questions are clear, unambiguous, and reward
                                             the best systems. We argue that creating a             knowledge (Section 3).
                                             QA dataset—and its ubiquitous leaderboard—                We are not saying that academic QA should sur-
                                             closely resembles running a trivia tournament:         render to trivia questions or the community—far
                                             you write questions, have agents—humans or             from it! The trivia community does not understand
                                             machines—answer questions, and declare a               the real world information seeking needs of users
                                             winner. However, the research community has
                                                                                                    or what questions challenge computers. However,
                                             ignored the lessons from decades of the trivia
                                             community creating vibrant, fair, and effective        they know how, given a bunch of questions, to de-
                                             QA competitions. After detailing problems              clare that someone is better at answering questions
                                             with existing QA datasets, we outline several          than another. This collection of tradecraft and prin-
                                             lessons that transfer to QA research: removing         ciples can help the QA community.
                                             ambiguity, discriminating skill, and adjudicat-           Beyond these general concepts that QA can learn
                                             ing disputes.                                          from, Section 4 reviews how the “gold standard” of
                                         1   Introduction                                           trivia formats, Quizbowl, can improve traditional
                                                                                                    QA. We then briefly discuss how research that uses
                                         This paper takes an unconventional analysis to an-         fun, fair, and good trivia questions can benefit from
                                         swer “where we’ve been and where we’re going” in           the expertise, pedantry, and passion of the trivia
                                         question answering (QA). Instead of approaching            community (Section 5).
                                         the question only as ACL researchers, we apply the
                                         best practices of trivia tournaments to QA datasets.       2   Surprise, this is a Trivia Tournament!
                                            The QA community is obsessed with evalua-
                                         tion. Schools, companies, and newspapers hail              “My research isn’t a silly trivia tournament,” you
                                         new SOTAs and topping leaderboards, giving rise to         say. That may be, but let us first tell you a little
                                         claims that an “AI model tops humans” (Najberg,            about what running a tournament is like, and per-
                                         2018) because it ‘won’ some leaderboard, putting           haps you might see similarities.
                                         “millions of jobs at risk” (Cuthbertson, 2018). But           First, the questions. Either you write them your-
                                         what is a leaderboard? A leaderboard is a statis-          self or you pay someone to write them (sometimes
                                         tic about QA accuracy that induces a ranking over          people on the Internet). There is a fixed number of
                                         participants.                                              questions you need to hit by a particular date.
                                            Newsflash: this is the same as a trivia tourna-            Then, you advertise. You talk about your ques-
                                         ment. The trivia community has been doing this             tions: who is writing them, what subjects are cov-
                                         for decades (Jennings, 2006); Section 2 details this       ered, and why people should try to answer them.
                                         overlap between the qualities of a first-class QA             Next, you have the tournament. You keep your
                                         dataset (and its requisite leaderboard). The ex-           questions secure until test time, collect answers
                                         perts running these tournaments are imperfect, but         from all participants, and declare a winner. After-
                                         they’ve learned from their past mistakes (see Ap-          ward, people use the questions to train for future
                                         pendix A for a brief historical perspective) and cre-      tournaments.
These have natural analogs to crowd sourcing                              Or consider SearchQA (Dunn et al., 2017), de-
questions, writing the paper, advertising, and run-                      rived from the game Jeopardy!, which asks “An
ning a leaderboard. The biases of academia put                           article that he wrote about his riverboat days was
much more emphasis on the paper, but there are                           eventually expanded into Life on the Mississippi.”
components where trivia tournament best practices                        The young apprentice and newspaper writer who
could help. In particular, we focus on fun, well-                        wrote the article is named Samuel Clemens; how-
calibrated, and discriminative tournaments.                              ever, the reference answer is that author’s later pen
                                                                         name, Mark Twain. Most QA evaluation metrics
2.1     Are we Having Fun?                                               would count Samuel Clemens as incorrect. In a
                                                                         real game of Jeopardy!, this would not be an issue
Many datasets use crowdworkers to establish hu-
                                                                         (Section 3.1).
man accuracy (Rajpurkar et al., 2016; Choi et al.,
2018). However, these are not the only humans                               Of course, fun is relative and any dataset is
who should answer a dataset’s questions. So should                       bound to contain at least some errors. However,
the datasets’ creators.                                                  playtesting is an easy way to find systematic prob-
                                                                         lems in your dataset: unfair, unfun playtests make
    In the trivia world, this is called a play test: get
                                                                         for ineffective leaderboards. Eating your own dog
in the shoes of someone answering the questions
                                                                         food can help diagnose artifacts, scoring issues, or
yourself. If you find them boring, repetitive, or
                                                                         other shortcomings early in the process.
uninteresting, so will crowdworkers; and if you,
as a human, can find shortcuts to answer ques-                              Boyd-Graber et al. (2012) created an interface
tions (Rondeau and Hazen, 2018), so will a com-                          for play testing that was fun enough that people
puter.                                                                   played for free. After two weeks the site was taken
                                                                         down, but it was popular enough that the trivia
    Concretely, Weissenborn et al. (2017) catalog ar-
                                                                         community forked the open source code to create a
tifacts in SQuAD (Rajpurkar et al., 2018), arguably
                                                                         bootleg version that is still going strong almost a
the most popular computer QA leaderboard; and
                                                                         decade later.
indeed, many of the questions are not particularly
fun to answer from a human perspective; they’re                             The deeper issues when creating a QA task are:
fairly formulaic. If you see a list like “Along with                     have you designed a task that is internally consis-
Canada and the United Kingdom, what country. . . ”,                      tent, supported by a scoring metric that matches
you can ignore the rest of the question and just type                    your goals (more on this in a moment), using gold
Ctrl+F (Russell, 2020; Yuan et al., 2019) to find                        annotations that correctly reward those who do the
the third country—Australia in this case—that ap-                        task well? Imagine someone who loves answering
pears with “Canada and the UK”. Other times, a                           the questions your task poses: would they have
SQ u AD playtest would reveal frustrating questions                      fun on your task? If so, you may have a good
that are i) answerable given the information in the                      dataset. Gamification (von Ahn, 2006) harnesses
paragraph but not with a direct span,1 ii) answer-                       users’ passion better motivator than traditional paid
able only given facts beyond the given paragraph,2                       labeling. Even if you pay crowdworkers, if your
iii) unintentionally embedded in a discourse, result-                    questions are particularly unfun, you may need to
ing in linguistically odd questions with arbitrary                       think carefully about your dataset and your goals.
correct answers,3 iv) or non-questions.
                                                                         2.2   Am I Measuring what I Care About?
    1
      A source paragraph says “In [Commonwealth coun-
tries]. . . the term is generally restricted to. . . Private education   Answering questions requires multiple skills: iden-
in North America covers the whole gamut. . . ”; thus, the ques-          tifying where an answer is mentioned (Hermann
tion “What is the term private school restricted to in the US?”          et al., 2015), knowing the canonical name for the
has the information needed but not as a span.
    2
      A source paragraph says “Sculptors [in the collec-                 answer (Yih et al., 2015), realizing when to answer
tion include] Nicholas Stone, Caius Gabriel Cibber, [...],               and when to abstain (Rajpurkar et al., 2018), or
Thomas Brock, Alfred Gilbert, [...] and Eric Gill[.]”, i.e.,             being able to justify an answer explicitly with evi-
a list of names; thus, the question “Which British sculptor
whose work includes the Queen Victoria memorial in front                 dence (Thorne et al., 2018). In QA, the emphasis
of Buckingham Palace is included in the V&A collection?”                 on SOTA and leaderboards has focused attention on
should be unanswerable in traditional machine reading.                   single automatically computable metrics—systems
    3
      A question “Who else did Luther use violent rhetoric
towards?” has the gold answer “writings condemning the                   tend to be compared by their ‘SQuAD score’ or
Jews and in diatribes against Turks”.                                    their ‘NQ score’, as if this were all there is to say
systems answer questions correctly, abstain when
                                                          they cannot, explain why they answered the way
                                                          they did, or whatever facet of QA is most important
                                                          for your dataset. Now, this leaderboard can rack
                                                          up citations as people chase the top spot. But your
                                                          leaderboard is only useful if it is discriminative:
                                                          does it separate the best from the rest?
Figure 1: Two datasets with 0.16 annotation error, but
                                                             There are many ways questions might not be
the top better discriminates QA ability. In the good      discriminative. If every system gets a question right
dataset (top), most questions are challenging but not     (e.g., abstain on “asdf” or correctly answer “What
impossible. In the bad dataset (bottom), there are more   is the capital of Poland?”), it does not separate
trivial or impossible questions and annotation error is   participants. Similarly, if every system flubs “what
concentrated on the challenging, discriminative ques-     is the oldest north-facing kosher restaurant” wrong,
tions. Thus, a smaller fraction of questions decide who   it also does not discriminate systems. Sugawara
sits atop the leaderboard, requiring a larger test set.
                                                          et al. (2018) call these questions “easy” and “hard”;
                                                          we instead argue for a three-way distinction.
about their relative capabilities. Like QA leader-           In between easy questions (system answers cor-
boards, trivia tournaments need to decide on a sin-       rectly with probability 1.0) and hard (probabil-
gle winner, but they explicitly recognize that there      ity 0.0), questions with probabilities nearer to
are more interesting comparisons to be made.              0.5 are more interesting. Taking a cue from Vy-
   For example, a tournament may recognize dif-           gotsky’s proximal development theory of human
fernt background/resources—high school, small             learning (Chaiklin, 2003), these discriminative
school, undergrads (Henzel, 2018). Similarly, more        questions—rather than the easy or the hard ones—
practical leaderboards would reflect training time        should help improve QA systems the most. These
or resource requirements (see Dodge et al., 2019)         Goldilocks4 questions are also most important for
including ‘constrained’ or ‘unconstrained’ train-         deciding who will sit atop the leaderboard; ide-
ing (Bojar et al., 2014). Tournaments also give           ally they (and not random noise) will decide. Un-
awards for specific skills (e.g., least incorrect an-     fortunately, many existing datasets seem to have
swers). Again, there are obvious leaderboard              many easy questions; Sugawara et al. (2020) find
analogs that would go beyond a single number. For         that ablations like shuffling word order (Feng and
example, in SQuAD 2.0, abstaining contributes the         Boyd-Graber, 2019), shuffling sentences, or only
same to the overall F1 as a fully correct answer,         offering the most similar sentence do not impair
obscuring whether a system is more precise or an          systems, and Rondeau and Hazen (2018) hypothe-
effective abstainer. If the task recognizes both abil-    size that most QA systems do little more than pat-
ities as important, reporting a single score risks        tern matching, impressive performance numbers
implicitly prioritizing one balance of the two.           notwithstanding.
   As a positive example for being explicit, the          2.4   Why so few Goldilocks Questions?
2018 FEVER shared task (Thorne et al., 2018) fa-
vors getting the correct final answer over exhaus-        This is a common problem in trivia tournaments,
tively justifying said answer with a a metric that        particularly pub quizzes (Diamond, 2009), where
required only one piece of evidence per question.         too difficult questions can scare off patrons. Many
However, by still breaking out the full justification     quiz masters prefer popularity over discrimination
precision and recall scores in the leaderboard, it        and thus prefer easier questions.
is still clear that the most precise system only was         Sometimes there are fewer Goldilocks questions
in fourth place on the “primary” leaderboard, and         not by choice, but by chance: a dataset becomes
that the top three “primary metric” winners are           less discriminative through annotation error. All
compromising on the evidence selection.                   datasets have some annotation error; if this annota-
                                                          tion error is concentrated on the Goldilocks ques-
2.3   Do my Questions Separate the Best?                     4
                                                               In a British folktale first recorded by Robert Southey,
                                                          an interloper, “Goldilocks”, finds three bowls of porridge:
Let us assume that you have picked a metric (or a         one too hot, one too cold, and one “just right”. Goldilocks
set of metrics) that capture what you care about:         questions’ difficulty are likewise “just right”.
tions, the dataset will be less useful. As we write              we focused on how datasets as a whole should be
this in 2019, humans and computers sometimes                     structured. Now, we focus on how specific ques-
struggle on the same questions. Thus, annotation                 tions should be structured to make the dataset as
error is likely to be correlated with which questions            valuable as possible.
will determine who will sit atop a leaderboard.
   Figure 1 shows two datsets with the same anno-                3.1    Avoiding Ambiguity and Assumptions
tation error and the same number of overall ques-                Ambiguity in questions not only frustrates answer-
tions. However, they have different difficulty dis-              ers who resolved the ambiguity ‘incorrectly‘, they
tributions and correlation of annotation error and               generally undermine the usefulness of questions
difficulty. The dataset that has more discrimina-                to precisely assess knowledge at all. For this rea-
tive questions and consistent annotator error has                son, the US Department of Transportation explicitly
fewer questions that are effectively useless for de-             bans ambiguous questions from exams for flight
termining the winner of the leaderboard. We call                 instructors (?); and the trivia community has de-
this the effective dataset proportion ρ. A dataset               veloped rules that prevent ambiguity from arising.
with no annotation error and only discriminative                 While this is true in many contexts, examples are
questions has ρ = 1.0, while the bad dataset in Fig-             rife in format called Quizbowl (Boyd-Graber et al.,
ure 1 has ρ = 0.16. Figure 2 shows the test set size             2012), whose very long questions6 showcase trivia
required to reliably discriminate systems for differ-            writers’ tactics. For example, Zhu Ying (2005 PAR -
ent values of ρ, based on a simulation described in              FAIT) warns [emphasis added]:
Appendix B.
                                                                        He’s not Sherlock Holmes, but his address is
   At this point, you might be despairing about how                     221B. He’s not the Janitor on Scrubs, but his
big you need your dataset to be.5 The same terror                       father is played by R. Lee Ermy. [. . . ] For ten
transfixes trivia tournament organizers. We dis-                        points, name this misanthropic, crippled, Vicodin-
                                                                        dependent central character of a FOX medical
cuss a technique used by them for making individ-                       drama.
ual questions more discriminative using a property                      ANSWER: Gregory House, MD
called pyramidality in Section 4.
                                                                 to head off these wrong answers.
3    The Craft of Question Writing                                  In contrast, QA datasets often contain ambigu-
                                                                 ous and under-specified questions. While this
One thing that trivia enthusiasts agree on is that
                                                                 sometimes reflects real world complexities such
questions need to be good questions that are well
                                                                 as actual under-specified or ill-formed search
written. Research shows that asking “good ques-
                                                                 queries (Faruqui and Das, 2018; Kwiatkowski et al.,
tions” requires sophisticated pragmatic reason-
                                                                 2019), simply ignoring this ambiguity is problem-
ing (Hawkins et al., 2015), and pedagogy explicitly
                                                                 atic. As a concrete example, consider Natural Ques-
acknowledges the complexity of writing effective
                                                                 tions (Kwiatkowski et al., 2019) where the gold
questions for assessing student performance (?, for
                                                                 answer to “what year did the us hockey team won
a book length treatment of writing good multiple
                                                                 [sic] the olympics” has the answers 1960 and 1980,
choice questions, see). There is no question that
                                                                 ignoring the US women’s team, which won in 1998
writing good questions is a craft in its own right.
                                                                 and 2018, and further assuming the query is about
   QA datasets, however, are often collected from
                                                                 ice rather than field hockey (also an olympic sport).
the wild or written by untrained crowdworkers.
                                                                 This ambiguity is a post hoc interpretation as the
Even assuming they are confident users of En-
                                                                 associated Wikipedia pages are, indeed, about the
glish (the primary language of QA datasets), crowd-
                                                                 United States men’s national ice hockey team. We
workers lack experience in crafting questions and
                                                                 contend the ambiguity persists in the original ques-
may introduce idiosyncracies that shortcut ma-
                                                                 tion, and the information retrieval arbitrarily pro-
chine learning (Geva et al., 2019). Similarly, data
                                                                 vides one of many interpretations. True to its name,
collected from the wild such as Natural Ques-
                                                                 these under-specified queries make up a substan-
tions (Kwiatkowski et al., 2019) by design have
                                                                 tial (if not almost the entire) fraction of natural
vast variations in quality. In the previous section,
                                                                 questions in search contexts.
    5
      Indeed, using a more sophisticated simulation approach
                                                                    6
Voorhees (2003) found that the TREC 2002 QA test set could             Like Jeopardy!, they are not syntactically questions but
not discriminate systems with less than a seven absolute score   still are designed to elicit knowledge-based responses; for
point difference.                                                consistency, we will still call them questions.
Average Accuracy
                                                                                                          90
                             = 0.25               = 0.85                     = 1.00
Test Needed

              10000                                                                                       80
               5000
               2500                                                                                       70
               1000
                100
                      12 5     10     15   12 5     10     15     12 5         10      15                 60
                                              Accuracy
                                                                                                          50

Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends on
both the difference in accuracy between the systems (x axis) and the average accuracy of the systems (closer to
50% is harder). Test set creators do not have much control over those. They do have control, however, over how
many questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but if
three quarters of your questions are too easy, too hard, or have annotation errors (left), you’ll need 15000.

   The problem is neither that such questions ex-               for the question in its framing but, in re-purposing
ist nor that MRQA considers questions relative to               data such as Natural Questions, opaquely relies on
the associated context. The problem is that tasks               it for the gold answers.
do not explicitly acknowledge the original ambigu-
ity and instead gloss over the implicit assumptions             3.2   Avoiding superficial evaluations
used in creating the data. This introduces poten-               A related issue is that, in the words of Voorhees
tial noise and bias (i.e., giving a bonus to systems            and Tice (2000), “there is no such thing as a ques-
that make the same assumptions as the dataset)                  tion with an obvious answer”. As a consequence,
in leaderboard rankings. At best, these will be-                trivia question authors are considered to delineate
come part of the measurement error of datasets                  acceptable and unacceptable answers.
(no dataset is perfect). At worst, they will reca-
                                                                   For example, Robert Chu (Harvard Fall XI) uses
pitulate the biases that went into the creation of
                                                                a mental model of an answerer to explicitly delin-
the datasets. Then, the community will implic-
                                                                eate the range of acceptable correct answers:
itly equate the biases with correctness: you get
high scores if you adopt this set of assumptions.                     In Newtonian gravity, this quantity satisfies Pois-
Then, these enter into real-world systems, further                    son’s equation. [. . . ] For a dipole, this quantity is
                                                                      given by negative the dipole moment dotted with
perpetuating the bias. Playtesting can reveal these                   the electric field. [. . . ] For 10 points, name this
issues (Section 2.1), as implicit assumptions can                     form of energy contrasted with kinetic.
rob a player of correctly answered questions. If                      ANSWER: potential energy (prompt on energy;
                                                                      accept specific types like electrical potential en-
you wanted to answer 2014 to “when did Michi-                         ergy or gravitational potential energy; do not
gan last win the championship”—when the Michi-                        accept or prompt on just “potential”)
gan State Spartans won the Women’s Cross Coun-
try championship—and you cannot because you                        Likewise, the style guides for writing questions
chose the wrong school, the wrong sport, and the                stipulate that you must give the answer type clearly
wrong gender, you would complain as a player; re-               and early on. These mentions specify whether you
searchers instead discover latent assumptions that              want a book, a collection, a movement, etc. It
crept into the data.7                                           also signals the level of specificity requested. For
   It is worth emphasizing that this is not a purely            example, a question about a date must state “day
hypothetical problem. For example, Open Domain                  and month required” (September 11, “month and
Retrieval Question Answering (Lee et al., 2019)                 year required” (April 1968), or “day, month, and
deliberately avoids providing a reference context               year required” (September 1, 1939). This is true
        7
                                                                for other answers as well: city and team, party and
   Where to draw the line is a matter of judgment;
computers—who lack common sense—might find questions            country, or more generally “two answers required”.
ambiguous where humans would not.                               Despite all of these conventions, no pre-defined set
of answers is perfect, and a process for adjudicating                Despite XLNet-123’s margin of almost four abso-
answers is an integral part of trivia competitions.               lute F1 (94 vs 98) on development data, a manual
   In major high school and college national com-                 inspection of a sample of 100 of XLNet-123’s wins
petitions and game shows, if low-level staff cannot               indicate that around two-thirds are ‘spurious’: 56%
resolve the issue by either throwing out a single                 are likely to be considered not only equally good
question or accepting minor variations (America in-               but essentially identical; 7% are cases where the
stead of USA), the low-level staff contacts the tour-             answer set ommits a correct alternative; and 5% of
nament director. The tournament director—with                     cases are ‘bad’ questions.11
their deeper knowledge of rules and questions—                       Our goal is not to dwell on the exact proportions,
often decide the issue. If not, the protest goes                  to minimize the achievements of these strong sys-
through an adjudication process designed to mini-                 tems, or to minimize the usefulness of quantitative
mize bias:8 write the summary of the dispute, get                 evaluations. We merely want to raise the limitation
all parties to agree to the summary, and then hand                of ‘blind automation’ for distinguishing between
the decision off to mutually agreed experts from                  systems on a leaderboard.
the tournament’s phone tree. The substance of the                    Taking our cue from the trivia community, here
disagreement is communicated (without identities),                is an alternative for MRQA. Test sets are only cre-
and the experts apply the rules and decide.                       ated for a specific time; all systems are submitted si-
   For example, a when a particularly inept Jeop-                 multaneously. Then, all questions and answers are
ardy! contestant9 answered endoscope to “Your                     revealed. System authors can protest correctness
surgeon could choose to take a look inside you with               rulings on questions, directly addressing the issues
this type of fiber-optic instrument”. Since the van               above. After agreement is reached, quantitative
Doren scandal (Freedman, 1997), every television                  metrics are computed for comparison purposes—
trivia contestant has an advocate assigned from an                despite their inherent limitations they at least can
auditing company. In this case, the advocate initi-               be trusted. Adopting this for MRQA would require
ated a process that went to a panel of judges who                 creating a new, smaller test set every year. How-
then ruled that endoscope (a more general term)                   ever, this would gradually refine the annotations
was also correct.                                                 and process.
   The need for a similar process seems to have                      This suggestion is not novel: Voorhees and Tice
been well-recognized in the earliest days of QA                   (2000) conclude that automatic evaluations are suf-
system bake-offs such as TREC - QA, and Voorhees                  ficient “for experiments internal to an organization
(2008) notes that                                                 where the benefits of a reusable test collection are
       [d]ifferent QA runs very seldom return exactly the
                                                                  most significant (and the limitations are likely to be
       same [answer], and it is quite difficult to deter-         understood)” (our emphasis) but that “satisfactory
       mine automatically whether the difference [. . . ]         techniques for [automatically] evaluating new runs”
       is significant.
                                                                  have not been found yet. We are not aware of any
In stark contrast to this, QA datasets typically only             change on this front—if anything, we seem to have
provide a single string or, if one is lucky, several              become more insensitive as a community to just
strings. A correct answer means exactly matching                  how limited our current evaluations are.
these strings or at least having a high token overlap
                                                                  3.3      Focus on the Bubble
F1 , and failure to agree with the pre-recorded ad-
missible answers will put you at an uncontestable                 While every question should be perfect, time and
disadvantage on the leaderboard (Section 2.2).                    resources are limited. Thus, authors of tournaments
   To illustrate how current evaluations fall short               have a policy of “focusing on the bubble”, where
of meaningful discrimination, we qualitatively an-                the “bubble” are the questions mostly likely to dis-
alyze two near-SOTA systems on SQuAD V1.1 on                      criminate between top teams.
CodaLab: the original XLNet (Yang et al., 2019)                      For humans, authors and editors focus on the
and a subsequent iteration called XLNet-123.10                    questions and clues that they predict will decide
   8                                                              the tournament. These questions are thoroughly
      https://www.naqt.com/rules/#protest
   9
      http://www.j-archive.com/showgame.                          playtested, vetted, and edited. Only after these
php?game_id=6112                                                  questions have been perfected will the other ques-
   10
      We could not find a paper describing XLNet-123 in detail,
                                                                    11
the submission is by http://tia.today.                                   Examples in Appendix C.
tions undergo the same level of polish.                      lock yourself out for a fraction of a second. So
   For computers, the same logic applies. Authors            the big mistake on the show is people who are
                                                             all adrenalized and are buzzing too quickly, too
should ensure that these questions are correct, free         eagerly.
of ambiguity, and unimpeachable. However, as far             Malone: OK. To some degree, Jeopardy! is kind
                                                             of a video game, and a crappy video game where
as we can tell, the authors of QA datasets do not
                                                             it’s, like, light goes on, press button—that’s it.
give any special attention to these questions.               Jennings: (Laughter) Yeah.
   Unlike a human trivia tournament, however—
with finite patience of the participants—this does      Thus, Jeopardy!’s buzzers are a gimmick to en-
not mean that you should necessarily remove all         sure good television; however, Quizbowl buzzers
of the easy or hard questions from your dataset—        discriminate knowledge (Section 2.3). Similarly,
spend more of your time/effort/resources on the         while TriviaQA (Joshi et al., 2017) is written by
bubble. You would not want to introduce a sam-          knowledgeable writers, the questions are not pyra-
pling bias that leads to inadvertently forgetting how   midal.
to answer questions like “who is buried in Grant’s
tomb?” (Dwan, 2000, Chapter 7).                         Pyramidal Recall that an effective dataset (tour-
                                                        nament) discriminates the best from the rest—the
4   Why Quizbowl is the Gold Standard                   higher the proportion of effective questions ρ, the
                                                        better. Quizbowl’s ρ is nearly 1.0 because discrimi-
We now focus our thus far wide-ranging QA dis-
                                                        nation happens within a question: after every word,
cussion to a specific format: Quizbowl, which has
                                                        an answerer must decide whether they have enough
many of the desirable properties outlined above.
                                                        information to answer the question. Quizbowl ques-
We have no delusion that mainstream QA will uni-
                                                        tions are arranged so that questions are maximally
versally adopt this format. However, given then
                                                        pyramidal: questions begin with hard clues—ones
community’s emphasis on fair evaluation, com-
                                                        that require deep understanding—to more accessi-
puter QA can borrow aspects from the gold standard
                                                        ble clues that are well known.
of human QA. We discuss what this may look like
in Section 5, but first we describe the gold standard   Well-Edited Quizbowl questions are created in
of human QA.                                            phases. First, the author selects the answer and as-
   We have shown several examples of Quizbowl           sembles (pyramidal) clues. A subject editor then re-
questions, but we have not yet explained in detail      moves ambiguity, adjusts acceptable answers, and
how the format works; see Rodriguez et al. (2019)       tweaks clues to optimize discrimination. Finally,
for a more comprehensive description. You might         a packetizer ensures the overall set is diverse, has
be scared off by how long the questions are. How-       uniform difficulty, and is without repeats.
ever, in real Quizbowl trivia tournaments, they are
                                                        Unnatural Trivia questions are fake: the asker
not finished because the questions are designed to
                                                        already knows the answer. But they’re no more fake
be interrupted.
                                                        than a course’s final exam, which like leaderboards
Interruptable A moderator reads a question.             are designed to test knowledge.
Once someone knows the answer, they use a signal-          Experts know when questions are ambigiuous;
ing device to “buzz in”. If the player who buzzed is    while “what play has a character whose father is
right, they get points. Otherwise, they lose points     dead” could be Hamlet, Antigone, or Proof, a good
and the question continues for the other team.          writer’s knowledge avoids the ambiguity. When
   Not all trivia games with buzzers have this prop-    authors omit these cues, the question is derided as a
erty, however. For example, take Jeopardy!, the         “hose” (Eltinge, 2013), which robs the tournament
subject of Watson’s tour de force (Ferrucci et al.,     of fun (Section 2.1).
2010). While Jeopardy! also uses signaling de-             One of the benefits of contrived formats is a
vices, these only work at the end of the question;      focus on specific phenomena. Dua et al. (2019)
Ken Jennings, one of the top Jeopardy! players          exclude questions an existing MRQA system could
(and also a Quizbowler) explains it on a Planet         answer to focus on challenging quantitative reason-
Money interview (Malone, 2019):                         ing. One of the trivia experts consulted in Wallace
                                                        et al. (2019) crafted a question that tripped up neu-
     Jennings: The buzzer is not live until Alex
     finishes reading the question. And if you buzz     ral QA systems with “this author opens Crime and
     in before your buzzer goes live, you actually      Punishment”; the top system confidently answers
Fyodor Dostoyevski. However, that phrase was                      choice. Here are our recommendations if you want
embedded in a longer question “The narrator in                    to have an effective leaderboard.
Cogwheels by this author opens Crime and Pun-
ishment to find it has become The Brothers Kara-                  Talk to Trivia Nerds You should talk to trivia
mazov”. Again, this shows the inventiveness and                   nerds because they have useful information (not
linguistic dexterity of the trivia community.                     just about the election of 1876). Trivia is not just
   A counterargument is that when real humans ask                 the accumulation of information but also connect-
questions—e.g., on Yahoo! Questions (Szpektor                     ing disparate facts (Jennings, 2006). These skills
and Dror, 2013), Quora (Iyer et al., 2017) or web                 are exactly those that we want computers to de-
search (Kwiatkowski et al., 2019)—they ignore the                 velop.
craft of question writing. Real humans tend to react                 Trivia nerds are writing questions anyway; we
to unclear questions with confusion or divergent                  can save money and time if we pool resources.
answers, often making explicit in their answers how               Computer scientists benefit if the trivia community
they interpreted the original question (“I assume                 writes questions that aren’t trivial for computers
you meant. . . ”).                                                to solve (e.g., avoiding quotes and named entities).
   Given real world applications will have to deal                The trivia community benefits from tools that make
with the inherent noise and ambiguity of unclear                  their job easier: show related questions, link to
questions, our systems must cope with it. However,                Wikipedia, or predict where humans will answer.
dealing with it is not the same as glossing over their               Likewise, the broader public has unique knowl-
complexities.                                                     edge and skills. In contrast to low-paid crowdwork-
                                                                  ers, public platforms for question answering and
Complicated Quizbowl is more complex than                         citizen science (Bowser et al., 2013) are brimming
other datasets. Unlike other datasets where you                   with free expertise if you can engage the relevant
just need to decide what to answer, you also need                 communities. For example, the Quora query “Is
to choose when to answer the question. While                      there a nuclear control room on nuclear aircraft car-
this improves the dataset’s discrimination, it can                riers?” is purportedly answered by someone who
hurt popularity because you cannot copy/paste code                worked in such a room (Humphries, 2017). As
from other QA tasks. The complicated pyramidal                    machine learning algorithms improve, the “good
structure also makes some questions—e.g., what is                 enough” crowdsourcing that got us this far may
log base four of sixty-four—difficult12 to ask. How-              simply not be enough for continued progress.
ever, the underlying mechanisms (e.g., reinforce-                    Many question answering datasets benefit from
ment learning) share properties with other tasks,                 the efforts of the trivia community. Ethically using
such as simultaneous translation (Grissom II et al.,              the data requires acknowledging their contributions
2014; Ma et al., 2019), human incremental process-                and using their input to create datasets (Jo and
ing (Levy et al., 2008; Levy, 2011), and opponent                 Gebru, 2020, Consent and Inclusivity).
modeling (He et al., 2016).
                                                                  Eat Your Own Dog Food As you develop new
5        A Call to Action                                         question answering tasks, you should feel comfort-
You may disagree with the superiority of Quizbowl                 able playing the task as a human. Importantly, this
as a QA framework (even for trivia nerds, not all                 is not just to replicate what crowdworkers are doing
agree. . . de gustibus non est disputandum). In this              (also important) but to remove hidden assumptions,
final section, we hope to distill our advice into a               institute fair metrics, and define the task well. For
call to action regardless of your question format of              this to feel real, you will need to keep score; have
                                                                  all of your coauthors participate and compare their
    12
         But not always impossible, as IHSSBCA shows:             scores.
          This is the smallest counting number which is              Again, we emphasize that human and com-
          the radius of a sphere whose volume is an integer       puter skills are not identical, but this is a benefit:
          multiple of π. It is also the number of distinct real
          solutions to the equation x7 − 19x5 = 0. This           humans natural aversion to unfairness will help you
          number also gives the ratio between the volumes         create a better task, while computers will blindly
          of a cylinder and a cone with the same heights          optimize a broken objective function (Bostrom,
          and radii. Give this number equal to the log base
          four of sixty-four.                                     2003; ?). As you go through the process of play-
                                                                  ing on your question–answer dataset, you can see
where you might have fallen short on the goals we         annotators disagree, the question should explic-
outline in Section 3.                                     itly state what level of specificity is required
                                                          (e.g., September 1, 1939 vs. 1939 or Leninism vs.
Won’t Somebody Look at the Data? After QA                 socialism). Or, if not all questions have a single
datasets are released, there should also be deeper,       answer, link answers to a knowledge base with mul-
more frequent discussion of actual questions within       tiple surface forms or explicitly enumerate which
the NLP community. Part of every post-mortem of           answers are acceptable.
trivia tournaments is a detailed discussion of the
questions, where good questions are praised and           Appreciate Ambiguity If your intended QA ap-
bad questions are excoriated. This is not meant           plication has to handle ambiguous questions, do
to shame the writers but rather to help build and         justice to the ambiguity by making it part of your
reinforce cultural norms: questions should be well-       task—for example, recognize the original ambigu-
written, precise, and fulfill the creator’s goals. Just   ity and resolve it (“did you mean. . . ”) instead of
like trivia tournaments, QA datasets resemble a           giving credit for happening to ‘fit the data’.
product for sale. Creators want people to invest             To ensure that our datasets properly “isolate the
time and sometimes money (e.g., GPU hours) in us-         property that motivated it in the first place” (?),
ing their data and submitting to their leaderboards.      we need to explicitly appreciate the unavoidable
It is “good business” to build a reputation for qual-     ambiguity instead of silently glossing over it.13
ity questions and discussing individual questions.           This is already an active area of research, with
    Similarly, discussing and comparing the actual        conversational QA being a new setting actively ex-
predictions made by the competing systems should          plored by several datasets (Reddy et al., 2018; Choi
be part of any competition culture—without it, it is      et al., 2018); and other work explicitly focusing
hard to tell what a couple of points on some leader-      on identifying useful clarification questions (Rao
board mean. To make this possible, we recommend           and Daumé III, 2018), thematically linked ques-
that leaderboards include an easy way for anyone          tions (Elgohary et al., 2018) or resolving ambi-
to download a system’s development predictions            guities that arise from coreference or pragmatic
for qualitative analyses.                                 constraints by rewriting underspecified question
                                                          strings in context (Elgohary et al., 2019).
Make Questions Discriminative We argue that
questions should be discriminative (Section 2.3),         Revel in Spectacle However, with more compli-
and while Quizbowl is one solution (Section 4), not       cated systems and evaluations, a return to the yearly
everyone is crazy enough to adopt this (beautiful)        evaluations of TRECQA may be the best option.
format. For more traditional QA tasks, you can            This improves not only the quality of evaluation
maximize the usefulness of your dataset by ensur-         (we can have real-time human judging) but also lets
ing as many questions as possible are challenging         the test set reflect the build it/break it cycle (Ruef
(but not impossible) for today’s QA systems.              et al., 2016), as attempted by the 2019 iteration
   But you can use some Quizbowl intuitions to            of FEVER. Moreover, another lesson the QA com-
improve discrimination. In visual QA, you can of-         munity could learn from trivia games is to turn it
fer increasing resolutions of the image. For other        into a spectacle: exciting games with a telegenic
settings, create pyramidality by adding metadata:         host. This has a benefit to the public, who see how
coreference, disambiguation, or alignment to a            QA systems fail on difficult questions and to QA
knowledge base. In short, consider multiple ver-          researchers, who have a spoonful of fun sugar to in-
sions/views of your data that progress from difficult     spect their systems’ output and their competitors’.
to easy. This not only makes more of your dataset            In between are automatic metrics that mimic the
discriminative but also reveals what makes a ques-        flexibility of human raters, inspired by machine
tion answerable.                                          translation evaluations (Papineni et al., 2002; Spe-
                                                          cia and Farzindar, 2010) or summarization (Lin,
Embrace Multiple Answers or Specify Speci-
                                                          2004). However, we should not forget that these
ficity As QA moves to more complicated for-
mats and answer candidates, what constitutes a              13
                                                               Not surprisingly, ‘inherent’ ambiguity is not limited to
correct answer becomes more complicated. Fully            QA ; Pavlick and Kwiatkowski (2019) show natural language
                                                          inference data have ‘inherent disagreements’ between humans
automatic evaluations are valuable for both train-        and advocate for recovering the full range of accepted infer-
ing and quick-turnaround evaluation. In the case          ences.
metrics were introduced as ‘understudies’—good                Christof Monz, Pavel Pecina, Matt Post, Herve
enough when quick evaluations are needed for sys-             Saint-Amand, Radu Soricut, Lucia Specia, and Aleš
                                                              Tamchyna. 2014. Findings of the 2014 workshop on
tem building but no substitute for a proper evalu-
                                                              statistical machine translation. In Proceedings of the
ation. In machine translation, Laubli et al. (2020)           Ninth Workshop on Statistical Machine Translation,
reveal that crowdworkers cannot spot the errors               pages 12–58, Baltimore, Maryland, USA. Associa-
that neural MT systems make—fortunately, trivia               tion for Computational Linguistics.
nerds are cheaper than professional translators.            Nick Bostrom. 2003. Ethical issues in advanced arti-
                                                              ficial intelligence. Institute of Advanced Studies in
Be Honest in Crowning QA Champions                            Systems Research and Cybernetics, 2:12–17.
While—particularly for leaderboards—it is
                                                            Anne Bowser, Derek Hansen, Yurong He, Carol
tempting to turn everything into a single number,             Boston, Matthew Reid, Logan Gunnell, and Jennifer
recognize that there are often different sub-tasks            Preece. 2013. Using gamification to inspire new cit-
and types of players who deserve recognition. A               izen science volunteers. In Proceedings of the First
simple model that requires less training data or              International Conference on Gameful Design, Re-
runs in under ten milliseconds may be objectively             search, and Applications, Gamification ’13, pages
                                                              18–25, New York, NY, USA. ACM.
more useful than a bloated, brittle monster of
a system that has a slightly higher F1 (Dodge               Jordan Boyd-Graber, Brianna Satinoff, He He, and
                                                               Hal Daume III. 2012. Besting the quiz master:
et al., 2019). While you may only rank by a single
                                                               Crowdsourcing incremental classification games. In
metric (this is what trivia tournaments do too),               Proceedings of Empirical Methods in Natural Lan-
you may want to recognize the highest-scoring                  guage Processing.
model that was built by undergrads, took no more
                                                            Seth Chaiklin. 2003. The Zone of Proximal Devel-
than one second per example, was trained only on              opment in Vygotsky’s Analysis of Learning and In-
Wikipedia, etc.                                               struction, Learning in Doing: Social, Cognitive
   Finally, if you want to make human–computer                and Computational Perspectives, page 39–64. Cam-
                                                              bridge University Press.
comparisons, pick the right humans. Paraphrasing
a participant of the 2019 MRQA workshop (Fisch              Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
et al., 2019), a system better than the average hu-           tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
                                                              moyer. 2018. QuAC: Question answering in con-
man at brain surgery does not imply superhuman
                                                              text. In Proceedings of Empirical Methods in Nat-
performance in brain surgery. Likewise, beating a             ural Language Processing.
distracted crowdworker on QA is not QA’s endgame.
                                                            Anthony Cuthbertson. 2018. Robots can now read bet-
If your task is realistic, fun, and challenging, you
                                                              ter than humans, putting millions of jobs at risk.
will find experts to play against your computer.
Not only will this give you human baselines worth           Paul Diamond. 2009. How To Make 100 Pounds A
                                                              Night (Or More) As A Pub Quizmaster. DP Quiz.
reporting—they can also tell you how to fix your
QA dataset. . . after all, they’ve been at it longer than   Jesse Dodge, Suchin Gururangan, Dallas Card, Roy
you have.                                                      Schwartz, and Noah A. Smith. 2019. Show your
                                                               work: Improved reporting of experimental results.
Acknowledgements Many thanks to Massimil-                      In Proceedings of the 2019 Conference on Empirical
                                                               Methods in Natural Language Processing and the
iano Ciaramita, Jon Clark, Christian Buck, Emily               9th International Joint Conference on Natural Lan-
Pitler, and Michael Collins for insightful discus-             guage Processing (EMNLP-IJCNLP), pages 2185–
sions that helped frame these ideas. Thanks to                 2194, Hong Kong, China. Association for Computa-
Kevin Kwok for permission to use Protobowl                     tional Linguistics.
screenshot and information.                                 Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
                                                              Stanovsky, Sameer Singh, and Matt Gardner. 2019.
                                                              DROP: A reading comprehension benchmark requir-
References                                                    ing discrete reasoning over paragraphs. In Confer-
                                                              ence of the North American Chapter of the Asso-
Luis von Ahn. 2006. Games with a purpose. Computer,           ciation for Computational Linguistics, pages 2368–
  39:92 – 94.                                                 2378, Minneapolis, Minnesota.
David Baber. 2015. Television Game Show Hosts: Bi-          Matthew Dunn, Levent Sagun, Mike Higgins,
  ographies of 32 Stars. McFarland.                          V. Ugur Güney, Volkan Cirik, and Kyunghyun
                                                             Cho. 2017. SearchQA: A new Q&A dataset aug-
Ondřej Bojar, Christian Buck, Christian Federmann,          mented with context from a search engine. CoRR,
  Barry Haddow, Philipp Koehn, Johannes Leveling,            abs/1704.05179.
R. Dwan. 2000. As Long as They’re Laughing: Grou-        He He, Jordan Boyd-Graber, Kevin Kwok, and Hal
   cho Marx and You Bet Your Life. Midnight Marquee        Daumé III. 2016. Opponent modeling in deep re-
   Press.                                                  inforcement learning. In Proceedings of the Interna-
                                                           tional Conference of Machine Learning.
Ahmed Elgohary, Denis Peskov, and Jordan Boyd-
  Graber. 2019. Can you unpack that? learning to         R. Robert Henzel. 2018. Naqt eligibility overview.
  rewrite questions-in-context. In Empirical Methods
  in Natural Language Processing.                        Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
                                                           stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber.         and Phil Blunsom. 2015. Teaching machines to read
  2018. Dataset and baselines for sequential open-         and comprehend. In Proceedings of Advances in
  domain question answering. In Empirical Methods          Neural Information Processing Systems.
  in Natural Language Processing.
                                                         Bryan Humphries. 2017. Is there a nuclear control
Stephen Eltinge. 2013. Quizbowl lexicon.                   room on nuclear aircraft carriers?

Manaal Faruqui and Dipanjan Das. 2018. Identifying       Shankar Iyer, Nikhil Dandekar, , and Kornél Csernai.
 well-formed natural language questions. In Proceed-       2017. Quora question pairs.
 ings of the 2018 Conference on Empirical Methods
                                                         Ken Jennings. 2006. Brainiac: adventures in the cu-
 in Natural Language Processing, pages 798–803,
                                                           rious, competitive, compulsive world of trivia buffs.
 Brussels, Belgium. Association for Computational
                                                           Villard.
 Linguistics.
                                                         Eun Seo Jo and Timnit Gebru. 2020. Lessons from
Shi Feng and Jordan Boyd-Graber. 2019. What ai can         archives: Strategies for collecting sociocultural data
  do for me: Evaluating machine learning interpreta-       in machine learning. In Proceedings of the Confer-
  tions in cooperative play. In International Confer-      ence on Fairness, Accountability, and Transparency.
  ence on Intelligent User Interfaces.
                                                         Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
David Ferrucci, Eric Brown, Jennifer Chu-Carroll,         Zettlemoyer. 2017. TriviaQA: A large scale dis-
  James Fan, David Gondek, Aditya A. Kalyanpur,           tantly supervised challenge dataset for reading com-
  Adam Lally, J. William Murdock, Eric Nyberg,            prehension. In Proceedings of the 55th Annual Meet-
  John Prager, Nico Schlaefer, and Chris Welty. 2010.     ing of the Association for Computational Linguistics
  Building Watson: An Overview of the DeepQA              (Volume 1: Long Papers), volume 1, pages 1601–
  Project. AI Magazine, 31(3).                            1611.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-     Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
  nsol Choi, and Danqi Chen, editors. 2019. Proceed-       field, Michael Collins, Ankur Parikh, Chris Alberti,
  ings of the 2nd Workshop on Machine Reading for          Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
  Question Answering. Association for Computational        Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
  Linguistics, Hong Kong, China.                           Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
                                                           Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
Morris Freedman. 1997. The fall of Charlie Van Doren.      ral questions: a benchmark for question answering
 The Virginia Quarterly Review, 73(1):157–165.             research. Transactions of the Association of Compu-
                                                           tational Linguistics.
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.
 Are we modeling the task or the annotator? an inves-    Samuel Laubli, Sheila Castilho, Graham Neubig,
 tigation of annotator bias in natural language under-     Rico Sennrich, Qinlan Shen, and Antonio Toral.
 standing datasets. In Proceedings of the 2019 Con-        2020. A set of recommendations for assessing hu-
 ference on Empirical Methods in Natural Language          man–machine parity in language translation. Jour-
 Processing and the 9th International Joint Confer-        nal of Artificial Intelligence Research, 67.
 ence on Natural Language Processing (EMNLP-
 IJCNLP), pages 1161–1166, Hong Kong, China. As-         Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
 sociation for Computational Linguistics.                  2019. Latent retrieval for weakly supervised open
                                                           domain question answering. In Proceedings of the
Alvin Grissom II, He He, Jordan Boyd-Graber, and           Association for Computational Linguistics.
  John Morgan. 2014. Don’t until the final verb wait:
  Reinforcement learning for simultaneous machine        Roger Levy. 2011. Integrating surprisal and uncertain-
  translation. In Proceedings of Empirical Methods         input models in online sentence comprehension: for-
  in Natural Language Processing.                          mal techniques and empirical results. In Proceed-
                                                           ings of the Association for Computational Linguis-
Robert X. D. Hawkins, Andreas Stuhlmüller, Judith De-      tics, pages 1055–1065.
  gen, and Noah D. Goodman. 2015. Why do you
  ask? good questions provoke informative answers.       Roger P. Levy, Florencia Reali, and Thomas L. Grif-
  In Proceedings of the 37th Annual Meeting of the         fiths. 2008. Modeling the effects of memory on hu-
  Cognitive Science Society.                               man online sentence processing with particle filters.
In Proceedings of Advances in Neural Information         Marc-Antoine Rondeau and T. J. Hazen. 2018. Sys-
  Processing Systems, pages 937–944.                        tematic error analysis of the Stanford question an-
                                                            swering dataset. In Proceedings of the Workshop
Chin-Yew Lin. 2004. Rouge: a package for auto-              on Machine Reading for Question Answering, pages
  matic evaluation of summaries. In Proceedings of          12–20, Melbourne, Australia. Association for Com-
  the Workshop on Text Summarization Branches Out           putational Linguistics.
  (WAS 2004).
                                                           Andrew Ruef, Michael Hicks, James Parker, Dave
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,             Levin, Michelle L. Mazurek, and Piotr Mardziel.
  Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,                2016. Build it, break it, fix it: Contesting secure
  Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and             development. In Proceedings of the 2016 ACM
  Haifeng Wang. 2019. STACL: Simultaneous trans-             SIGSAC Conference on Computer and Communica-
  lation with implicit anticipation and controllable la-     tions Security, CCS ’16, pages 690–703, New York,
  tency using prefix-to-prefix framework. In Proceed-        NY, USA. ACM.
  ings of the 57th Annual Meeting of the Association       Dan Russell. 2020. Why control-f is the single most
  for Computational Linguistics, pages 3025–3036,            important thing you can teach someone about search.
  Florence, Italy. Association for Computational Lin-        SearchReSearch.
  guistics.
                                                           Lucia Specia and Atefeh Farzindar. 2010. A.: Esti-
Kenny Malone. 2019. How uncle Jamie broke jeop-              mating machine translation post-editing effort with
  ardy.                                                      HTER. In In: AMTA 2010 Workshop, Bringing MT
                                                             to the User: MT Research and the Translation Indus-
Adam Najberg. 2018. Alibaba AI model tops humans             try. The 9th Conference of the Association for Ma-
  in reading comprehension.                                  chine Translation in the Americas.
                                                           Saku Sugawara, Kentaro Inui, Satoshi Sekine, and
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
                                                             Akiko Aizawa. 2018. What makes reading compre-
  Jing Zhu. 2002. BLEU: a method for automatic eval-
                                                             hension questions easier? In Proceedings of Empiri-
  uation of machine translation. In Proceedings of the
                                                             cal Methods in Natural Language Processing.
  Association for Computational Linguistics.
                                                           Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and
Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent            Akiko Aizawa. 2020. Assessing the benchmark-
   disagreements in human textual inferences. Transac-       ing capacity of machine reading comprehension
   tions of the Association for Computational Linguis-       datasets. In Association for the Advancement of Ar-
   tics, 7:677–694.                                          tificial Intelligence.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.        Idan Szpektor and Gideon Dror. 2013. From query
  Know what you don’t know: Unanswerable ques-                to question in one click: Suggesting synthetic ques-
  tions for SQuAD. In Proceedings of the Association          tions to searchers. In Proceedings of WWW 2013.
  for Computational Linguistics.                           David Taylor, Colin McNulty, and Jo Meek. 2012.
                                                             Your starter for ten: 50 years of university challenge.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Percy Liang. 2016. SQuAD: 100,000+ questions             James Thorne, Andreas Vlachos, Oana Cocarascu,
  for machine comprehension of text. In Proceedings          Christos Christodoulopoulos, and Arpit Mittal, edi-
  of Empirical Methods in Natural Language Process-          tors. 2018. Proceedings of the First Workshop on
  ing.                                                       Fact Extraction and VERification (FEVER). Associ-
                                                             ation for Computational Linguistics, Brussels, Bel-
Sudha Rao and Hal Daumé III. 2018. Learning to ask           gium.
  good questions: Ranking clarification questions us-
  ing neural expected value of perfect information. In     Ellen M. Voorhees. 2003. Evaluating the evaluation: A
  Proceedings of the 56th Annual Meeting of the As-           case study using the TREC 2002 question answering
  sociation for Computational Linguistics (Volume 1:          track. In Proceedings of the 2003 Human Language
  Long Papers), pages 2737–2746, Melbourne, Aus-              Technology Conference of the North American Chap-
  tralia. Association for Computational Linguistics.          ter of the Association for Computational Linguistics,
                                                              pages 260–267.
Siva Reddy, Danqi Chen, and Christopher D. Manning.        Ellen M. Voorhees. 2008.        Evaluating Question
   2018. CoQA: A conversational question answering            Answering System Performance, pages 409–430.
   challenge. Transactions of the Association for Com-        Springer Netherlands, Dordrecht.
   putational Linguistics, 7:249–266.
                                                           Ellen M Voorhees and Dawn M Tice. 2000. Building
Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and            a question answering test collection. In Proceedings
  Jordan L. Boyd-Graber. 2019. Quizbowl: The                  of the 23rd annual international ACM SIGIR confer-
  case for incremental question answering. CoRR,              ence on Research and development in information
  abs/1904.04792.                                             retrieval, pages 200–207. ACM.
Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya-
   mada, and Jordan Boyd-Graber. 2019. Trick me if
  you can: Human-in-the-loop generation of adversar-
   ial question answering examples. Transactions of
   the Association of Computational Linguistics, 10.
Dirk Weissenborn, Georg Wiese, and Laura Seiffe.
  2017. Making neural QA as simple as possible but
  not simpler. In Conference on Computational Natu-
  ral Language Learning.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
  bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
  Xlnet: Generalized autoregressive pretraining for
  language understanding. CoRR, abs/1906.08237.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and
 Jianfeng Gao. 2015. Semantic parsing via staged
  query graph generation: Question answering with
  knowledge base. In Proceedings of the 53rd Annual
 Meeting of the Association for Computational Lin-
  guistics and the 7th International Joint Conference
  on Natural Language Processing (ACL 2015), pages
 1321–1331. ACL.

Xingdi Yuan, Jie Fu, Marc-Alexandre Cote, Yi Tay,
  Christopher Pal, and Adam Trischler. 2019. Interac-
  tive machine comprehension with information seek-
  ing agents.
You can also read