PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Page created by Bernard Lucas
 
CONTINUE READING
PAQ: 65 Million Probably-Asked Questions and
                                                                     What You Can Do With Them

                                          Patrick Lewis†‡ Yuxiang Wu‡ Linqing Liu‡ Pasquale Minervini‡ Heinrich Küttler†
                                                      Aleksandra Piktus† Pontus Stenetorp‡ Sebastian Riedel†‡
                                                          †
                                                            Facebook AI Research ‡ University College London
                                                                          plewis@fb.com

                                                               Abstract                         the whole corpus, and then retrieve-and-read docu-
                                                                                                ments in order to answer questions on-the-fly (Chen
                                             Open-domain Question Answering models              et al., 2017; Lee et al., 2019a, inter alia).
                                             which directly leverage question-answer (QA)
arXiv:2102.07033v1 [cs.CL] 13 Feb 2021

                                                                                                   A second class of models, closed-book question
                                             pairs, such as closed-book QA (CBQA) mod-
                                             els and QA-pair retrievers, show promise in        answering (CBQA) models, have recently been
                                             terms of speed and memory compared to con-         proposed. They learn to directly map questions to
                                             ventional models which retrieve and read from      answers from training question-answer (QA) pairs
                                             text corpora. QA-pair retrievers also offer in-    without access to a background corpus (Roberts
                                             terpretable answers, a high degree of control,     et al., 2020; Ye et al., 2021). These models usu-
                                             and are trivial to update at test time with new    ally take the form of pretrained seq2seq models
                                             knowledge. However, these models lack the
                                                                                                such as T5 (Raffel et al., 2020) or BART (Lewis
                                             accuracy of retrieve-and-read systems, as sub-
                                             stantially less knowledge is covered by the        et al., 2019a), fine-tuned on QA-pairs. It has re-
                                             available QA-pairs relative to text corpora like   cently been shown that current closed-book models
                                             Wikipedia. To facilitate improved QA-pair          mostly memorise training QA-pairs, and can strug-
                                             models, we introduce Probably Asked Ques-          gle to answer questions that do not overlap with
                                             tions (PAQ), a very large resource of 65M          training data (Lewis et al., 2020b).
                                             automatically-generated QA-pairs. We intro-           Models which explicitly retrieve (training) QA-
                                             duce a new QA-pair retriever, RePAQ, to com-       pairs, rather than memorizing them in parameters,
                                             plement PAQ. We find that PAQ preempts
                                                                                                have been shown to perform competitively with
                                             and caches test questions, enabling RePAQ to
                                             match the accuracy of recent retrieve-and-read     CBQA models (Lewis et al., 2020b; Xiao et al.,
                                             models, whilst being significantly faster. Us-     2020). These models have a number of useful prop-
                                             ing PAQ, we train CBQA models which out-           erties, such as fast inference, interpretable outputs
                                             perform comparable baselines by 5%, but trail      (by inspecting retrieved QA-pairs), and the ability
                                             RePAQ by over 15%, indicating the effective-       to update the model’s knowledge at test time by
                                             ness of explicit retrieval. RePAQ can be con-      adding or removing QA-pairs.
                                             figured for size (under 500MB) or speed (over
                                                                                                   However, CBQA and QA-pair retriever models
                                             1K questions per second) whilst retaining high
                                             accuracy. Lastly, we demonstrate RePAQ’s           are currently not competitive with retrieve-and-read
                                             strength at selective QA, abstaining from an-      systems in terms of accuracy, largely because the
                                             swering when it is likely to be incorrect. This    training QA-pairs they operate on cover substan-
                                             enables RePAQ to “back-off” to a more expen-       tially less knowledge than background corpora like
                                             sive state-of-the-art model, leading to a com-     Wikipedia. In this paper, we explore whether mas-
                                             bined system which is both more accurate and       sively expanding the coverage of QA-pairs enables
                                             2x faster than the state-of-the-art model alone.
                                                                                                CBQA and QA-pair retriever models which are
                                                                                                competitive with retrieve-and-read models.
                                         1   Introduction
                                                                                                   We present Probably Asked Questions (PAQ), a
                                         Open-domain QA (ODQA) systems usually have             semi-structured Knowledge Base (KB) of 65M nat-
                                         access to a background corpus that can be used to      ural language QA-pairs, which models can mem-
                                         answer questions. Models which explicitly exploit      orise and/or learn to retrieve from. PAQ differs
                                         this corpus are commonly referred to as Open-book      from traditional KBs in that questions and answers
                                         models (Roberts et al., 2020). They typically index    are stored in natural language, and that questions
are generated such that they are likely to appear in       tions: i) introduce PAQ, 65M QA-pairs automati-
ODQA datasets. PAQ is automatically constructed            cally generated from Wikipedia, and demonstrate
using a question generation model and Wikipedia.           the importance of global filtering for high quality
To ensure generated questions are not only answer-         ii) introduce RePAQ, a QA system designed to uti-
able given the passage they are generated from,            lize PAQ and demonstrate how it can be optimised
we employ a global filtering post-processing step          for memory, speed or accuracy iii) investigate the
employing a state-of-the-art ODQA system. This             utility of PAQ for CBQA models, improving by 5%
greatly reduces the amount of wrong and ambigu-            but note significant headroom to RePAQ iv) demon-
ous questions compared other approaches (Fang              strate RePAQ’s strength on selective QA, enabling
et al., 2020; Alberti et al., 2019), and is critical for   us to combine RePAQ with a state-of-the-art QA
high-accuracy, downstream QA models.                       model, making it both more accurate and 2x faster1
   To complement PAQ we develop RePAQ, a
                                                           2     Open-Domain Question Answering
question answering model based on question re-
trieval/matching models, using dense Maximum               ODQA is the task of answering natural language
Inner Product Search-based retrieval, and option-          factoid question from an open set of domains. A
ally, re-ranking. We show that PAQ and RePAQ               typical question might be “when was the last year
provide accurate ODQA predictions, at the level            astronauts landed on the moon?”, with a target an-
of relatively recent large-scale retrieve-and-read         swer “1972”. The goal of ODQA is to develop
systems such as RAG (Lewis et al., 2020a) on Nat-          an answer function m : Q 7→ A, where Q and A
uralQuestions (Kwiatkowski et al., 2019a) and Triv-        respectively are the sets of all possible questions
iaQA (Joshi et al., 2017). PAQ instances are anno-         and answers. We assume there is a distribution
tated with scores that reflect how likely we expect        P (q, a) of QA-pairs, defined over Q × A. A good
questions to appear, which can be used to control          answer function will minimise the expected error
the memory footprint of RePAQ by filtering the KB          over P (q, a) with respect to some loss function,
accordingly. As a result, RePAQ is extremely flexi-        such as answer string match. In practice, we do
ble, allowing us to configure QA systems with near         not have access to P (q, a), and instead rely on an
state-of-the-art results, very small memory size, or       empirical sample of QA-pairs K drawn from P ,
inference speeds of over 1,000 questions per sec-          and measure the empirical loss of answer functions
ond. Memory-optimised configurations of RePAQ              on K. Our goal in this work is to implicitly model
won two of the four tracks of the 2020 Efficien-           P (q, a) so that we can draw a large sample of QA-
tQA NeurIPS competition (Min et al., 2020a), with          pairs, PAQ, which we can train on and/or retrieve
system sizes of 336MB and 29MB, respectively.              from. Drawing a sufficiently large sample will over-
   We also show that PAQ is a useful source of train-      lap with K, essentially pre-empting and caching
ing data for CBQA models. BART models trained              questions that humans may ask at test-time. This
on PAQ outperform baselines trained on standard            allows us to shift computation from test-time to
data by 5%. However, these models struggle to              train-time compared to retrieve-and-read methods.
effectively memorise all the knowledge in PAQ,
lagging behind RePAQ by 15%. This demonstrates             3     Generating Question-Answer Pairs
the effectiveness of RePAQ at leveraging PAQ.              In this section, we describe the process for generat-
   Finally, we show that since RePAQ’s question            ing PAQ. Given a large background corpus C, our
matching score correlates well with QA accuracy,           QA-pair generation process consists of the follow-
it effectively “knows when it doesn’t know”, allow-        ing components:
ing for selective question answering (Rodriguez
et al., 2019) where QA systems may abstain from                1. A passage selection model ps (c), to identify
answering if confidence is too low. Whilst answer                 passages which humans are likely to ask ques-
abstaining is important in its own right, it also en-             tions about.
ables an elegant “back-off” approach where we can              2. An answer extraction model pa (a | c), for
defer to a more accurate but expensive QA system                  identifying spans in a passage that are more
when answer confidence is low. This enables us to                 likely to be answers to a question.
make use of the best of both speed and accuracy.             1
                                                               The PAQ data, models and code will be made available at
  In summary, we make the following contribu-              https://github.com/facebookresearch/PAQ
Figure 1: Top Left: Generation pipeline for QA-pairs in PAQ. Top Right: PAQ used as training data for CBQA
models. Bottom Left: RePAQ retrieves similar QA-pairs to input questions from PAQ. Bottom right: RePAQ’s
confidence is predictive of accuracy. If confidence is low, we can defer to slower, more accurate systems, like FiD.

  3. A question generator model pq (q | a, c) that,         ther randomly or using heuristics. We then train a
     given a passage and an answer, generates a             model to minimise negative log-likelihood of posi-
     question.                                              tive passages relative to negatives.
  4. A filtering QA model pf (a | q, C) that gen-              We implement ps with RoBERTa (Liu et al.,
     erates an answer for a given question. If an           2019a) and obtain positive passages from Natu-
     answer generated by pf does not match the              ral Questions (NQ, Kwiatkowski et al., 2019b). We
     answer a question was generated from, the              sample easy negatives at random from Wikipedia,
     question is discarded. This ensures generated          and hard negatives from the same Wikipedia arti-
     questions are consistent (Alberti et al., 2019).       cle as the positive passage. Easy negatives help
                                                            the model to learn topics of interest, and hard nega-
   As shown in Fig. 1, these models are applied se-         tives help to differentiate between interesting and
quentially to generate QA-pairs for PAQ, in a simi-         non-interesting passages from the same article. We
lar manner to related work in contextual QA gen-            evaluate by measuring how highly positive valida-
eration (Alberti et al., 2019; Lewis et al., 2019b).        tion passages are ranked amongst negatives.
First a passage c is selected with a high probabil-
ity according to ps . Next, candidate answers a             3.2   Answer Extraction, pa
are extracted from c using pa , and questions q are
                                                            Given a passage, this component identifies spans
generated for each answer in the passage using pq .
                                                            that are likely to be answers to questions. We con-
Lastly, pf generates a new answer a0 for the ques-
                                                            sider two alternatives: an off-the-shelf Named En-
tion. If the source answer a matches a0 , then (q, a)
                                                            tity Recogniser (NER) or training a BERT (Devlin
is deemed consistent and is added to PAQ. In the
                                                            et al., 2019) answer extraction model on NQ.
following, we describe each component in detail.
                                                               The NER answer extractor simply extracts all
3.1   Passage Selection, ps                                 named entities from a passage.2 The majority of
                                                            questions in ODQA datasets consist of entity men-
The passage selection model ps is used to find pas-
                                                            tions (Kwiatkowski et al., 2019a; Joshi et al., 2017),
sages which are likely to contain information that
                                                            so this approach can achieve high answer coverage.
humans may ask about, and would thus be good
                                                            However, as we extract all entity mentions in a
candidates to generate questions from. We learn ps
                                                            passage, we may extract unsuitable mentions, or
using a similar method to Karpukhin et al. (2020b).
                                                            miss answers that do not conform to the NER sys-
Concretely, we assume access to a set of positive
                                                            tem’s annotation schema. The trained answer span
passages C + ⊂ C, which we obtain from answer-
                                                            extractor aims to address these issues.
containing passages from an ODQA training set.
Since we do not have a set of labelled negative                2
                                                                 We use the sPaCy (Honnibal and Montani, 2017) NER
passages, we sample negatives from the corpus, ei-          system, trained on OntoNotes (Hovy et al., 2006).
BERT answer span extraction is typically per-           QA-pairs for CBQA models. The second treats
formed by modelling answer start and end in-               PAQ as a KB, which models learn to directly re-
dependently, obtaining answer probabilities via            trieve from. These use-cases are related, as CBQA
pa (a | c) = p(astart | c) × p(aend | c) (Devlin et al.,   models have been shown to memorise the train data
2019). We found this approach be sub-optimal for           in their parameters, latently retrieving from them
modelling multiple span occurrences in a passage.          at test time (Lewis et al., 2020b; Domingos, 2020).
We instead use an approach that breaks the condi-
tional independence of answer spans by directly            4.1     PAQ for Closed-Book QA
predicting pa (a | c) = p([astart , aend ] | c). This      We fine-tune a BART-large (Lewis et al., 2019a)
model first feeds a passage through BERT, before           with QA-pairs from the concatenation of the train-
concatenating the start and end token representa-          ing data and PAQ, using a similar training proce-
tions of all possible spans of up to length 30, before     dure to Roberts et al. (2020). We use early stopping
feeding them into a MLP to compute pa (a | c). At          on the validation set and a batch size of 512, and
generation time, the answer extraction component           note learning is slow, requiring 70 epochs on PAQ.
extracts a constant number of spans from each pas-         Following recent best practices(Alberti et al., 2019;
sage, ranked by their extraction probabilities.            Yang et al., 2019), we then fine-tune on the training
                                                           QA-pairs only, using validation Exact Match score
3.3    Question Generation, pq                             for early stopping (Rajpurkar et al., 2018).
Given a passage and an answer, this component                 We note that an effective CBQA model must
generates likely questions with that answer. To            be able to understand the semantics of questions
indicate the answer and its occurrence in the pas-         and how to generate answers, in addition to be-
sage, we prepend the answer to the passage and             ing able to store a large number of facts in its pa-
label the answer span with surrounding special to-         rameters. This model thus represents a kind of
kens. We construct the dataset from NQ, TriviaQA,          combined parametric knowledgebase and retrieval
and SQuAD, and perform standard fine-tuning of             system (Petroni et al., 2020a). The model proposed
BART-base (Lewis et al., 2019a) to obtain pq .             in the next section, RePAQ, represents an explicit
                                                           non-parametric instantiation of this idea.
3.4    Filtering, pf
                                                           4.2     RePAQ
The filtering model pf improves the quality of gen-
erated questions, by ensuring that they are con-           RePAQ is a retrieval model which operates on KBs
sistent: that the answer they were generated is            of QA-pairs, such as PAQ. RePAQ extends recently
likely to be a valid answer to the question. Previous      proposed nearest neighbour QA-pair retriever mod-
work (Alberti et al., 2019; Fang et al., 2020) has em-     els (Lewis et al., 2020b; Xiao et al., 2020). These
ployed a machine reading comprehension (MRC)               models assume access to a KB of N QA-pairs
QA model for this purpose, pf (a | q, c), which            K = {(q1 , a1 )...(qN , aN )}. These models provide
produces an answer when supplied with a question           an answer to a test question q by finding the most
and the passage it was generated from. We refer to         relevant QA-pair (q 0 , a0 ) in K, using a scalable rel-
this as local filtering. However, local filtering will     evance function, then returning a0 as the answer
not remove questions which are ambiguous (Min              to q. Such a function could be implemented using
et al., 2020b), and can only be answered correctly         standard information retrieval techniques, (e.g. TF-
with access to the source passage. Thus, we use            IDF) or learnt from training data. RePAQ is learnt
an ODQA model for filtering, pf (a | q, C), sup-           from ODQA data and consists of a neural dense
plied with only the generated question, and not the        retriever, optionally followed by a neural reranker.
source passage. We refer to this as global filtering,      4.2.1    RePAQ Retriever
and later show that it is vital for strong downstream
                                                           Our retriever adopts the dense Maximum Inner
results. We use FiD-base with 50 retrieved pas-
                                                           Product Search (MIPS) retriever paradigm, that
sages, trained on NQ (Izacard and Grave, 2020).
                                                           has recently been shown to obtain state-of-the-art
4     Question Answering using PAQ                         results in a number of settings (Karpukhin et al.,
                                                           2020b; Lee et al., 2021, inter alia). Our goal is to
We consider two uses of PAQ for building QA mod-           embed queries q and indexed items d into a repre-
els. The first is to use PAQ as a source of training       sentation space via embedding functions gq and gd ,
so that the inner product gq (q)> gd (d) is maximised              many retrieve-and-read generators which consume
for items relevant to q. In our case, queries are                  thousands of tokens to generate an answer. Effi-
questions and indexed items are QA-pairs (q 0 , a0 ).              cient MIPS libraries such as FAISS (Johnson et al.,
We make our retriever symmetric by embedding q 0                   2017) enable RePAQ’s retriever to answer 100s to
rather than (q 0 , a0 ), meaning that only one embed-              1000s of questions per second (see Section 5.2.3).
ding function gq is required, which maps questions                 We use a KB for RePAQ consisting of training set
to embeddings. This applies a useful inductive bias,               QA-pairs concatenated with QA-pairs from PAQ.
and we find that it aids stability during training.
                                                                   4.2.2 RePAQ Reranker
   Learning the embedding function gq is compli-
cated by the lack of labelled question pair para-                  The accuracy of RePAQ can be improved using a
phrases in ODQA datasets. We propose a latent                      reranker on the top-K QA-pairs from the retriever.
variable approach similar to retrieval-augmented                   The reranker uses cross-encoding (Humeau et al.,
generation (RAG, Lewis et al., 2020b),3 where                      2020), and includes the retrieved answer in the
we we index training QA-pairs rather than doc-                     scoring function for richer featurisation. We con-
uments. For an input question q, the top K QA-                     catenate the input question q, the retrieved question
pairs (q 0 , a0 ) are retrieved by a retriever pret where          q 0 and its answer a0 with separator tokens, and feed
pret (q|q 0 ) ∝ exp(gq (q)> gq (q 0 )). These are then             it through ALBERT. We obtain training data in
fed into a seq2seq model pgen which generates an                   the following manner: For a training QA-pair, we
answer for each retrieved QA-pair, before a final                  first retrieve the top 2K QA-pairs from PAQ using
answer is produced by marginalising,                               RePAQ’s retriever. If one of the retrieved QA-pairs
                                                                   has the correct answer, we treat that QA-pair as a
                                                                   positive, and randomly sample K-1 of the incorrect
                   X
       p(a|q) =          pgen (a|q, q 0 , a0 )pret (q 0 |q),
         (a0 ,q 0 )∈top-k pret (·|q)
                                                                   retrieved questions as negatives. We train by min-
                                                                   imising negative log likelihood of positives relative
As pgen generates answers token-by-token, credit                   to 10 negatives, and rerank 50 retrieved pairs at test
can be given for retrieving helpful QA-pairs which                 time. The reranker improves accuracy at the ex-
do not exactly match the target answer. For ex-                    pense of some speed. However, as QA-pairs consist
ample, for the question “when was the last time                    of fewer tokens than passages, the reranker is still
anyone was on the moon” and target answer “De-                     faster than retrieve-and-read models, even when
cember 1972”, retrieving “when was the last year                   using architectures such as ALBERT-xxlarge.
astronauts landed on the moon” with answer “1972”
will help to generate the target answer, despite the               5       Results
answers having different granularity. After training,              We first examine the PAQ resource in general, be-
we discard pret 4 , retaining only the question em-                fore exploring how both CBQA models and RePAQ
bedder g. We implement pret with ALBERT (Lan                       perform using PAQ, comparing to recently pub-
et al., 2020) with an output dimension of 768, and                 lished systems. We use the Natural Questions (NQ,
pgen with BART-large (Lewis et al., 2019a). We                     Kwiatkowski et al., 2019b) and TriviaQA (Joshi
train using the top 100 retrieved QA-pairs, and                    et al., 2017) datasets to assess performance, evalu-
refresh the embedding index every 5 training steps.                ating with the standard Exact Match (EM) score.
   Once the embedder gq is trained, we build a
test-time QA system by embedding and indexing                      5.1      Examining PAQ
a QA KB such as PAQ. Answering is achieved by                      We generate PAQ by applying the pipeline de-
retrieving the most similar stored question, and re-               scribed in Section 3 to the Wikipedia dump from
turning its answer. The matched QA-pair can be                     Karpukhin et al. (2020a), which splits Wikipedia
displayed to the user, providing a mechanism for                   into 100-word passages. We use passage selection
more interpretable answers than CBQA models and                    model ps to rank all 21M passages, and generate
   3
     Other methods, such as heuristically constructing para-
                                                                   from the top 10M, before applying global filtering.5
phrase pairs assuming that questions with the same answer are         We are interested in understanding the effective-
paraphrases, and training with sampled negatives would also        ness of different answer extractors, and whether
be valid, but were not competitive in our early experiments
   4                                                                   5
     We could use pgen as a reranker/aggregator for QA, but in          Generation was stopped when downstream performance
practice find it both slower and less accurate than the reranker   with RePAQ did not significantly improve with more ques-
described in Section 4.2.2                                         tions.
Dataset
           Extracted Unique Filtered
                                     Ratio
                                                Coverage        that 82% of questions accurately capture informa-
           Answers    Qs     QAs                NQ TQA          tion from the passage and contain sufficient details
PAQL,1    76.4M 58.0M 14.1M 24.4% 88.3 90.2                     to locate the answer. 16% of questions confuse
PAQL,4    76.4M 225.2M 53.8M 23.9% 89.8 90.9
PAQN E,1 122.2M 65.4M 12.0M 18.6% 83.5 88.3
                                                                the semantics of certain answer types, either by
                                                                conflating similar entities in the passage or by mis-
PAQ         165.7M 279.2M 64.9M 23.2% 90.2 91.1
                                                                interpreting rare phrases (see examples 7 and 8 in
Table 1: PAQ dataset statistics and ODQA dataset an-            Table 2). Finally, we find small numbers of gram-
swer coverage. “Ratio” refers to the number of gener-           mar errors (such as example 5) and mismatched
ated questions which pass the global consistency filter.        wh-words (5% and 2% respectively).8
                                                                Other observations PAQ often contains several
generating more questions per answer span results               paraphrases of the same QA-pair. This redun-
leads to better results. To address these questions,            dancy reflects how information is distributed in
we create three versions of PAQ, described below.               Wikipedia, with facts often mentioned on several
PAQL uses the learnt answer extractor, and a ques-              different pages. Generating several questions per
tion generator trained on NQ and TriviaQA. We                   answer span also increases redundancy. Whilst this
extract 8 answers per passage and use a beam size               means that PAQ could be more information-dense
of 4 for question generation. In PAQL,1 we only                 if a de-duplication step was applied, we later show
use the top scoring question in the beam, whereas               that RePAQ always improves with more questions
in PAQL,4 we use all four questions from the beam,              in its KB (section 5.2.1). This suggests that it is
allowing for several questions to be generated from             worth increasing redundancy for greater coverage.
one answer in a passage. PAQN E,1 uses the NER
                                                                5.2      Question Answering Results
answer extractor, and a generator trained only on
NQ. PAQN E,1 allow us assess whether diversity in               In this section, we shall compare how the PAQ-
the form of answer extractors and question genera-              leveraging models proposed in section 4 compare
tors leads to better results. The final KB, referred to         to existing approaches. We primarily compare to
as just “PAQ”, is the union of PAQL and PAQN E .                a state-of-the-art retrieve-and-read model, Fusion-
   As shown in Table 1, PAQ consists of 65M fil-                in-Decoder (FiD, Izacard and Grave, 2020). FiD
tered QA pairs.6 This was obtained by extracting                uses DPR (Karpukhin et al., 2020b) to retrieve rele-
165M answer spans and generating 279M unique                    vant passages from Wikipedia, and feeds them into
questions before applying global filtering. Table 1             T5 (Raffel et al., 2020) to generate a final answer.
shows that the PAQL pipeline is more efficient than                Table 3 shows the highest-accuracy configura-
PAQN E , with 24.4% of QA-pairs surviving filter-               tions of our models alongside recent state-of-the-art
ing, compared to 18%.                                           approaches. We make the following observations:
                                                                Comparing rows 2 and 7 shows that a CBQA BART
PAQ Answer Coverage To evaluate answer ex-                      model trained with PAQ outperforms a compara-
tractors, we calculate how many answers in the                  ble NQ-only model by 5%, and only 3% behind
validation sets of TriviaQA and NQ also occur in                T5-11B (row 1) which has 27x more parameters.
PAQ’s filtered QA-pairs.7 Table 1 shows that the                Second, we note strong results for RePAQ on NQ
answer coverage of PAQ is very high – over 90%                  (47.7%, row 9), outperforming recent retrieve-and-
for both TriviaQA and NQ. Comparing PAQL with                   read systems such as RAG by over 3% (row 4).
PAQN E shows that the learnt extractor achieves                    Multi-task training RePAQ on NQ and TriviaQA
higher coverage, but the union of the two leads to              improves TriviaQA results by 1-2% (comparing
the highest coverage overall. Comparing PAQL,1                  rows 8-9 with 10-11). RePAQ does not perform
and PAQL,4 indicates that using more questions                  quite as strongly on TriviaQA (see section 5.2.6),
from the beam also results in higher coverage.                  but is within 5% of RAG, and outperforms concur-
PAQ Question Generation Quality Illustrative                    rent work on real-time QA, DensePhrases (row 6,
examples from PAQ can be seen in Table 2. Man-                  Lee et al., 2021). Lastly, row 12 shows that combin-
ual inspection of 50 questions from PAQ reveals                 ing RePAQ and FiD-large into a combined system
                                                                is 0.9% more accurate than FiD-large (see Section
   6
    Each question only has one answer due to global filtering   5.2.4 for more details).
   7
    performed using standard answer normalisation (Ra-
                                                                   8
jpurkar et al., 2016)                                                  Further details and automatic metrics in Appendix A.3
# Question                                                  Answer               Comment
1 who created the dutch comic strip panda                Martin Toonder     X
2 what was the jazz group formed by john hammond in 1935 Goodman Trio       X
3 astrakhan is russia’s main market for what commodity   fish               X
4 what material were aramaic documents rendered on       leather            X
5 when did the giant panda chi chi died                  22 July 1972       X, Grammar error
6 pinewood is a village in which country                 England            ∼, Also a Pinewood village in USA
7 who was the mughal emperor at the battle of lahore     Ahmad Shah Bahadur 7 Confuses with Ahmad Shah Abdali
8 how many jersey does mitch richmond have in the nba    2                  7 His Jersey No. was 2

Table 2: Representative Examples from PAQ. Xindicates correct, ∼ ambiguous and 7 incorrect facts respectively

 #      Model Type                     Model                                           NaturalQuestions TriviaQA
 1      Closed-book                    T5-11B-SSM (Roberts et al., 2020)                     35.2             51.8
 2      Closed-book                    BART-large (Lewis et al., 2020b)                      26.5             26.7
 3      QA-pair retriever              Dense retriever (Lewis et al., 2020b)                 26.7             28.9
 4      Open-book, retrieve-and-read   RAG-Sequence (Lewis et al., 2020a)                    44.5             56.8
 5      Open-book, retrieve-and-read   FiD-large, 100 docs (Izacard and Grave, 2020)         51.4             67.6
 6      Open-book, phrase index        DensePhrases (Lee et al., 2021)                       40.9             50.7
 7      Closed-book                    BART-large, pre-finetuned on PAQ                      32.7             33.2
 8      QA-pair retriever              RePAQ (retriever only)                                41.2             38.8
 9      QA-pair retriever              RePAQ (with reranker)                                 47.7             50.7
 10     QA-pair retriever              RePAQ-multitask (retriever only)                      41.7             41.3
 11     QA-pair retriever              RePAQ-multitask (with reranker)                       47.6             52.1
 12     QA-pair retriever              RePAQ-multitask w/ FiD-Large Backoff                  52.3             67.3

Table 3: Exact Match score for highest accuracy RePAQ configurations in comparison to recent state-of-the-art
systems. Highest score indicated in bold, highest non-retrieve-and-read model underlined.

5.2.1     Ablating PAQ using RePAQ                                                                    Exact Match
                                                              # KB           Filtering    Size
Table 4 shows RePAQ’s accuracy when using differ-                                                   Retrieve Rerank
ent PAQ variants. To establish the effect of filtering,       1 NQ-Train         -       87.9K       27.9      31.8
we evaluate RePAQ with unfiltered, locally-filtered           2   PAQL,1     None        58.0M       21.6      30.6
and globally-filtered QA-pairs on PAQL,1 . Com-               3   PAQL,1     Local       31.7M       28.3      34.9
paring rows 2, 3 and 4 shows that global filtering is         4   PAQL,1     Global      14.1M       38.6      44.3
crucial, leading to a 9% and 14% improvement over             5   PAQL,4     Global      53.8M       40.3      45.2
locally-filtered and unfiltered datasets respectively.        6   PAQN E,1   Global      12.0M       37.3      42.6
   We also note a general trend in Table 4 that               7 PAQ          Global      64.9M       41.6      46.4
adding more globally-filtered questions improves
accuracy. Rows 4 and 5 show that using four ques-            Table 4: The effect of different PAQ subsets on the NQ
tions per answer span is better than generating one          validation accuracy of RePAQ
(+0.9%), and Rows 5,6 and 7 show that combin-
ing PAQN E and PAQL results in a further 1.2%
                                                             ter the background corpus for a retrieve-and-read
improvement. Empirically we did not observe any
                                                             model (Izacard et al., 2020). We compare the sys-
cases where increasing the number of globally fil-
                                                             tem size of a FiD-large system and RePAQ as the
tered QA-pairs reduced accuracy, even when there
                                                             number of items (passages and QA-pairs respec-
were millions of QA-pairs already.
                                                             tively) in their indexes are reduced. We select
5.2.2     System Size vs Accuracy                            which passages and QA-pairs are included using
                                                             the passage selection model ps .9 Further experi-
PAQ’s QA-pairs are accompanied by scores of how
                                                             mental details can be found in appendix A.4. Fig. 2
likely they are to be asked. These scores can be
used to filter the KB and reduce the RePAQ sys-                9
                                                                 Here, we use PAQL1 , which is 5x smaller than the full
tem size. A similar procedure can be used to fil-            PAQ, but retains most of the accuracy (see Table 4)
Model                   Retriever Reranker Exact Match Q/sec
                     45                                                                        FiD-large               -           -            51.4         0.5
NQ Dev Exact Match

                                                                                               FiD-base                -           -            48.2          2
                     40
                                                                                               RePAQ              base             -            40.9        1400
                                                                                               RePAQ              large            -            41.2        1100
                     35
                                                              RePAQ-retriever                  RePAQ             xlarge            -            41.5        800
                                                              RePAQ-reranker
                     30
                                                              FID-Large
                                                                                               RePAQ              base           base           45.7          55
                                                                                               RePAQ              large         xlarge          46.2          10
                          0      1     2       3        4     5       6         7              RePAQ             xlarge        xxlarge          47.6          6
                                           System Size (GB)

Figure 2: System size vs. accuracy for RePAQ and FiD-                               Table 5: Inference speeds of various configurations of
large as a function of the number of items in the index.                            RePAQ compared to FiD models on NQ

                                                                                                     90
                                                                                                                                                  RePAQ-retriever
shows the that both system sizes can be reduced                                                                                                   RePAQ-reranker
                                                                                                     80
                                                                                                                                                  FID-Large

                                                                                    NQ Exact Match
several-fold with only a small drop in accuracy,
                                                                                                     70
demonstrating the effectiveness of ps . FiD can
achieve a higher accuracy, but requires larger sys-                                                  60
tem sizes. RePAQ can be reduced to a smaller size
                                                                                                     50
before a significant accuracy drop, driven primarily
by the higher information density of QA-pairs rela-
                                                                                                          0.0    0.2            0.4         0.6       0.8           1.0
tive to passages, and fewer model parameters used                                                                      Fraction of Questions Answered
by RePAQ compared to FiD. Highly-optimized
                                                                                                     Figure 3: Risk-coverage plot for FiD and RePAQ.
RePAQ models won the “500MB” and “Smallest
System” tracks of the EfficientQA NeurIPS com-
petition with disk images of 336MB and 29MB                                         5.2.4 Selective Question Answering
respectively. For further details on EfficientQA,                                   QA systems should not just be able to answer ac-
the reader is referred to Min et al. (2020a).                                       curately, but also “know when they don’t know”,
5.2.3                         Inference Speed vs Accuracy                           and abstain from answering when they are unlikely
                                                                                    to produce good answers. This task is challenging
We train a variety of differently-sized RePAQ                                       for current systems (Asai and Choi, 2020; Jiang
models to explore the relationship between accu-                                    et al., 2020b), and has been approached in MRC
racy and inference speed. We use a fast Hierar-                                     by training on unanswerable questions (Rajpurkar
chical Navigable Small World (HNSW) index in                                        et al., 2018) and for trivia systems by leveraging
FAISS (Malkov and Yashunin, 2018; Johnson et al.,                                   incremental QA formats (Rodriguez et al., 2019).
2017)10 and measure the time required to evalu-                                        We find that RePAQ’s retrieval and reranking
ate the NQ test set on a system with access to one                                  scores are well-correlated with answering correctly.
GPU.11 Table 5 shows these results. Some retriever-                                 This allows RePAQ to be used for selective question
only RePAQ models can answer over 1000 ques-                                        answering by abstaining when the score is below a
tions per second, and are relatively insensitive to                                 certain threshold. Figure 3 shows a risk-coverage
model size, with ALBERT-base only scoring 0.5%                                      plot (Wang et al., 2018) for RePAQ and FiD, where
lower than ALBERT-xlarge. They also outperform                                      we use FiD’s answer log probability for its answer
retrieve-and-read models like REALM (40.4%,                                         confidence.12 The plot shows the accuracy on the
Guu et al., 2020) and recent real-time QA models                                    top N% highest confidence answers for NQ. If we
like DensePhrases (40.9%, Lee et al., 2021). We                                     require models to answer 75% of user questions,
find that larger, slower RePAQ rerankers achieve                                    RePAQ’s accuracy on the questions it does answer
higher accuracy. However, even the slowest RePAQ                                    is 59%, whereas FiD, which has poorer calibration,
is 3x faster than FiD-base, whilst only being 0.8%                                  scores only 55%. This difference is even more
less accurate, and 12x faster than FiD-large.                                       pronounced with stricter thresholds – with coverage
                     10                                                                          12
      The HNSW index has negligible (∼0.1%) drop in re-                                  We also investigate improving FiD’s calibration using an
triever accuracy compared to a flat index                                           auxiliary model, see Appendix A.6. We find that the most
   11
      System details can be found in Appendix A.5                                   effective way to calibrate FiD is to use RePAQ’s confidences
of 50%, RePAQ outperforms FiD by over 10%. FiD           Input: who was the film chariots of fire about          A: Eric Liddell
only outperforms RePAQ when we require systems           who was the main character in chariots of fire          A: Eric Liddell X
                                                         who starred in the movie chariots of fire               A: Ian Charleson 7
to answer more than 85% of questions.                    which part did straan rodger play in chariots of fire   A: Sandy McGrath 7
   Whilst RePAQ’s selective QA is useful in its own      who played harold in the 1981 film chariots of fire     A: Ben Cross 7
                                                         who is the main character in chariots of fire           A: Eric Liddell X
right, it also allows us to combine the slow but ac-
                                                         Input: what is the meaning of the name didymus          A: twin
curate FiD with the fast and precise RePAQ, which
                                                         what language does the name didymus come from A: Greek                    7
we refer to as backoff. We first try to answer with      where does the name didymus come from in english A: Greek                 7
RePAQ, and if the confidence is below a threshold        what does the word domus mean in english         A: home                  7
                                                         how long has the term domus been used            A: 1000s of years        7
determined on validation data, we pass the ques-
                                                         what does the greek word didyma mean             A: twin                  X
tion onto FiD. For NQ, the combined system is            Input: what is the name of a group of llamas            A: herd
2.1x faster than FiD-large, with RePAQ answering         what are llamas and alpacas considered to be            A: domesticated   7
57% of the questions, and the overall accuracy is        what are the figures of llamas in azapa valley          A: Atoca          7
1% higher than FiD-large (see table 3).                  what are the names of the llamas in azapa valley        A: Atoca          7
                                                         what is the scientific name for camels and llamas       A:Camelidae       7
   If inference speed is a priority, the threshold can   are llamas bigger or smaller than current forms         A:larger          7
be decreased so that RePAQ answers 80% of the
questions, which retains the same overall accuracy       Table 6: Examples of top 5 retrieved QA-pairs for NQ.
as FiD, with a 4.6x speedup. For TriviaQA, the           Italics indicate QA-pairs chosen by reranker.
combined system backs off to FiD earlier, due to
the stronger relative performance of FiD. Addi-                                                    Q     A-only    No
                                                         Model                           Total
                                                                                                 Overlap Overlap Overlap
tional details can be found in appendix A.6
                                                         CBQA BART w/ NQ                  26.5     67.6          10.2       0.8
5.2.5   Analysing RePAQ’s Predictions                    CBQA BART w/ NQ+PAQ 28.2                  52.8          24.4      9.4
                                                           + final NQ finetune 32.7                69.8          22.2      7.51
Some examples of top retrieved questions are
shown Table 6. When RePAQ answers correctly,             RePAQ (retriever only)           41.7     65.4          31.7      21.4
                                                         RePAQ (with reranker)            47.3     73.5          39.7      26.0
the retrieved question is a paraphrase of the test
question from PAQ in 89% of cases. As such,              Table 7: NQ Behavioural splits (Lewis et al., 2020b).
there is high (80.8 ROUGE-L) similarity between          “Q overlap” are test questions with paraphrases in train-
correctly-answered test questions and the top re-        ing data. “A-only” are test questions where answers
trieved questions. 9% of test questions even exist       appear in training data, but questions do not. “No over-
verbatim in PAQ, and are thus trivial to answer. The     lap” where neither question or answer overlap.
reranker primarily improves over the retriever for
ambiguous cases, and cases where the top retrieved
                                                         answer it correctly, that QA-pair will not be added
answer does not have the right granularity. In 32%
                                                         to PAQ, and RePAQ cannot use it to answer the
of cases, RePAQ does not retrieve the correct an-
                                                         test question. The NQ FiD-base-50-passage model
swer in the top 50 QA-pairs, suggesting a lack of
                                                         used for filtering scores 46.1% and 53.1% for NQ
coverage may be a significant source of error. In
                                                         and TriviaQA respectively. RePAQ actually outper-
these cases, retrieved questions are much less simi-
                                                         forms the filter model on NQ by 1.6%. This is pos-
lar to the test question than for correctly answered
                                                         sible because generated questions may be phrased
questions, dropping by 20 ROUGE-L. We also ob-
                                                         in such a way that they are easier to answer, e.g. be-
serve cases where retrieved questions match the test
                                                         ing less ambiguous (Min et al., 2020b). RePAQ can
question, but the retrieved answer does not match
                                                         then retrieve the paraphrased QA-pair and answer
the desired answer. This is usually due to different
                                                         correctly, even if the filter could not answer the
answer granularity, but in a small number of cases
                                                         test question directly. The filtering model’s weaker
was due to factually incorrect answers.
                                                         scores on TriviaQA helps explain why RePAQ is
5.2.6   Does the Filtering Model Limit                   not as strong on this dataset. We speculate that us-
        RePAQ’s Accuracy?                                ing a stronger filtering model for TriviaQA would
As RePAQ relies on retrieving paraphrases of test        in turn improve RePAQ’s results.
questions, we may expect that the ODQA filtering
                                                         5.3     Closed-book QA vs RePAQ
model places an upper bound on it’s performance.
For example, if a valid QA-pair is generated which       Table 7 shows results on test set splits which mea-
overlaps with a test QA-pair, but the filter cannot      sure how effectively models memorise QA-pairs
from the NQ train set (“Q overlap”), and generalise      that are likely to be asked. QA-pairs have also
to novel questions (“A overlap only” and “No over-       been used in semantic role labelling, such as
lap”).13 Comparing CBQA models trained on NQ             QA-SRL (FitzGerald et al., 2018).
vs those trained on NQ and PAQ show that models
trained with PAQ answer more questions correctly         Real-time ODQA Systems that prioritise fast
from the “A-only overlap” and “No overlap” cate-         runtimes over accuracy are sometimes referred to
gories, indicating they have learnt facts not present    as real-time QA systems (Seo et al., 2018). Den-
in the NQ train set. Applying additional NQ fine-        SPI (Seo et al., 2019) and a contemporary work,
tuning on the PAQ CBQA model improves scores             DensePhrases (Lee et al., 2021), approach this by
on “Q overlap” (indicating greater memorisation          indexing all possible phrases in a background cor-
of NQ), but scores on the other categories drop          pus, and learn mappings from questions to passage-
(indicating reduced memorisation of PAQ). How-           phrase pairs. We also build an index for fast answer-
ever, RePAQ, which explicitly retrieves from PAQ         ing, but generate and index globally-answerable
rather than memorising it in parameters, strongly        questions. Indexing QA-pairs can be considered as
outperforms the CBQA model in all categories,            indexing summaries of important facts from the cor-
demonstrating that the CBQA model struggles to           pus, rather than indexing the corpus itself. We also
memorise enough facts from PAQ. CBQA mod-                generate and store multiple questions per passage-
els with more parameters may be better able to           answer pair, relieving information bottlenecks from
memorise PAQ, but have downsides in terms of             encoding a passage-answer pair into a single vector.
system resources. Future work should address how         Question Generation for QA Question gener-
to better store PAQ in CBQA model parameters.            ation has been used for various purposes, such
                                                         as data augmentation (Alberti et al., 2019; Lewis
6        Related Work
                                                         et al., 2019b; Lee et al., 2021), improved re-
ODQA has received much attention in both for its         trieval (Nogueira et al., 2019), generative mod-
practical applications, and as a benchmark for how       elling for contextual QA (Lewis and Fan, 2018),
NLP models store and access knowledge (Chen              as well as being studied in its own right (Du et al.,
and Yih, 2020; Petroni et al., 2020b).                   2017; Hosking and Riedel, 2019). Serban et al.
                                                         (2016) generate large numbers of questions from
KBQA A number of early approaches in ODQA
                                                         Freebase, but do not address how to use them for
focused on using structured KBs (Berant et al.,
                                                         QA. Closest to our work is the recently-proposed
2013) such as Freebase (Bollacker et al., 2007),
                                                         OceanQA (Fang et al., 2020). OceanQA first gen-
with recent examples from Févry et al. (2020) and
                                                         erates contextual QA-pairs from Wikipedia. At
Verga et al. (2020). This approach often has high
                                                         test-time, a document retrieval system is used to re-
precision but suffers when KB doesn’t match user
                                                         trieve the most relevant passage for a question and
requirements, or where the schema limits what
                                                         the closest pre-generated QA-pair from that pas-
knowledge can be stored. We populate our knowl-
                                                         sage is then selected. In contrast, we focusing on
edgebase with semi-structured QA pairs which are
                                                         generating a large KB of non-contextual, globally-
specifically likely to be relevant at test time miti-
                                                         consistent ODQA questions and exploring what
gating both of these drawbacks, and sharing many
                                                         QA systems are facilitated by such a resource.
of the benefits, such as precision and extensibility.
Open Information Extraction Our work                     7   Conclusion
touches on automatic KB construction and open
                                                         We have introduced PAQ, a dataset of 65M QA-
information extraction (OpenIE) (Angeli et al.,
                                                         pairs, and explored how it could be used to improve
2015). Here, the goal is to mine facts from free
                                                         ODQA models. We demonstrated the effectiveness
text into structured or semi-structured forms,
                                                         of RePAQ, a system which retrieves from PAQ,
typically (subject, relation, object) triples for use
                                                         in terms of accuracy, speed, space efficiency and
in tasks such as slot-filling (Surdeanu, 2013). We
                                                         selective QA. Generating PAQ is computationally
generate natural language QA-pairs rather than
                                                         intensive due to its large scale, but should be a use-
OpenIE triples, and we do not attempt to extract
                                                         ful, re-usable resource for more accurate, smaller
all possible facts in a corpus, focusing only those
                                                         and faster QA models. Nevertheless, future work
    13
         See Lewis et al. (2020b) for further details.   should be carried out to improve the efficiency of
generation in order to expand PAQ’s coverage.             Danqi Chen and Wen-tau Yih. 2020. Open-domain
   We also demonstrated PAQ’s utility for improved          question answering. In ACL (tutorial), pages 34–37.
                                                            Association for Computational Linguistics.
CBQA models, but note a large accuracy gap be-
tween our CBQA models and RePAQ. Explor-                  Nicola De Cao, Gautier Izacard, Sebastian Riedel,
ing the trade-offs between storing and retrieving           and Fabio Petroni. 2020. Autoregressive Entity
knowledge parametrically or non-parametrically is           Retrieval. arXiv:2010.00904 [cs, stat]. ArXiv:
                                                            2010.00904.
a topic of great current interest (Lewis et al., 2020a;
De Cao et al., 2020), and PAQ should be a useful          Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
testbed for probing this relationship further. We            Kristina Toutanova. 2019. BERT: Pre-training of
also note that PAQ could be used as general data-            Deep Bidirectional Transformers for Language Un-
                                                             derstanding. In Proceedings of the 2019 Conference
augmentation when training any open-domain QA                of the North American Chapter of the Association
model or retriever. Whilst we consider such work             for Computational Linguistics: Human Language
out-of-scope for this paper, leveraging PAQ to im-          Technologies, Volume 1 (Long and Short Papers),
prove retrieve-and-read and other systems systems            pages 4171–4186, Minneapolis, Minnesota. Associ-
                                                             ation for Computational Linguistics.
should be explored in future work.
                                                          Pedro Domingos. 2020. Every Model Learned by Gra-
                                                            dient Descent Is Approximately a Kernel Machine.
References                                                  arXiv:2012.00152 [cs, stat]. ArXiv: 2012.00152.

Chris Alberti, Daniel Andor, Emily Pitler, Jacob De-      Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-
  vlin, and Michael Collins. 2019. Synthetic QA Cor-        ing to Ask: Neural Question Generation for Reading
  pora Generation with Roundtrip Consistency. In            Comprehension. In Proceedings of the 55th Annual
  Proceedings of the 57th Annual Meeting of the             Meeting of the Association for Computational Lin-
  Association for Computational Linguistics, pages          guistics (Volume 1: Long Papers), pages 1342–1352,
  6168–6173, Florence, Italy. Association for Compu-        Vancouver, Canada. Association for Computational
  tational Linguistics.                                     Linguistics.

                                                          Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun,
Gabor Angeli, Melvin Jose Johnson Premkumar, and            and Jingjing Liu. 2020.       Accelerating Real-
  Christopher D. Manning. 2015. Leveraging Lin-             Time Question Answering via Question Generation.
  guistic Structure For Open Domain Information Ex-         arXiv:2009.05167 [cs]. ArXiv: 2009.05167.
  traction. In Proceedings of the 53rd Annual Meet-
  ing of the Association for Computational Linguis-       Nicholas FitzGerald, Julian Michael, Luheng He, and
  tics and the 7th International Joint Conference on        Luke Zettlemoyer. 2018. Large-Scale QA-SRL Pars-
  Natural Language Processing (Volume 1: Long Pa-           ing. arXiv:1805.05377 [cs]. ArXiv: 1805.05377.
  pers), pages 344–354, Beijing, China. Association
  for Computational Linguistics.                          Jerome H. Friedman. 2001. Greedy function approx-
                                                             imation: A gradient boosting machine. Annals of
Akari Asai and Eunsol Choi. 2020. Challenges in              Statistics, 29(5):1189–1232. Publisher: Institute of
  Information Seeking QA:Unanswerable Questions              Mathematical Statistics.
  and Paragraph Retrieval. arXiv:2010.11915 [cs].
  ArXiv: 2010.11915.                                      Thibault Févry, Livio Baldini Soares, Nicholas FitzGer-
                                                            ald, Eunsol Choi, and Tom Kwiatkowski. 2020. En-
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy        tities as Experts: Sparse Memory Access with En-
  Liang. 2013. Semantic parsing on freebase from            tity Supervision. arXiv:2004.07202 [cs]. ArXiv:
  question-answer pairs. In EMNLP, pages 1533–              2004.07202.
  1544. ACL.
                                                          Kelvin Guu, Kenton Lee, Zora Tung, Panupong
                                                            Pasupat, and Ming-Wei Chang. 2020. REALM:
Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts.       Retrieval-Augmented Language Model Pre-
  2007. Freebase: A shared database of structured           Training.    arXiv:2002.08909 [cs]. ArXiv:
  general human knowledge. In AAAI, pages 1962–             2002.08909.
  1963. AAAI Press.
                                                          Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Danqi Chen, Adam Fisch, Jason Weston, and Antoine          Natural language understanding with Bloom embed-
  Bordes. 2017. Reading Wikipedia to Answer Open-          dings, convolutional neural networks and incremen-
  Domain Questions. In Proceedings of the 55th An-         tal parsing. To appear.
  nual Meeting of the Association for Computational
  Linguistics (Volume 1: Long Papers), pages 1870–        Tom Hosking and Sebastian Riedel. 2019. Eval-
  1879, Vancouver, Canada. Association for Computa-         uating Rewards for Question Generation Models.
  tional Linguistics.                                       arXiv:1902.11049 [cs]. ArXiv: 1902.11049.
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance         Uszkoreit, Quoc Le, and Slav Petrov. 2019a. Natu-
  Ramshaw, and Ralph Weischedel. 2006. OntoNotes:          ral Questions: a Benchmark for Question Answering
  the 90% solution. In Proceedings of the Human Lan-       Research. Transactions of the Association of Com-
  guage Technology Conference of the NAACL, Com-           putational Linguistics.
  panion Volume: Short Papers, NAACL-Short ’06,
  pages 57–60, USA. Association for Computational        Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
  Linguistics.                                             field, Michael Collins, Ankur P. Parikh, Chris Al-
                                                           berti, Danielle Epstein, Illia Polosukhin, Jacob De-
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,           vlin, Kenton Lee, Kristina Toutanova, Llion Jones,
  and Jason Weston. 2020. Poly-encoders: Trans-            Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,
  former Architectures and Pre-training Strategies         Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019b.
  for Fast and Accurate Multi-sentence Scoring.            Natural questions: a benchmark for question answer-
  arXiv:1905.01969 [cs]. ArXiv: 1905.01969.                ing research. Trans. Assoc. Comput. Linguistics,
                                                           7:452–466.
Gautier Izacard and Edouard Grave. 2020. Leveraging
  Passage Retrieval with Generative Models for Open      Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
  Domain Question Answering. arXiv:2007.01282              Kevin Gimpel, Piyush Sharma, and Radu Soricut.
  [cs]. ArXiv: 2007.01282.                                 2020. Albert: A lite bert for self-supervised learning
                                                           of language representations. In International Con-
Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola     ference on Learning Representations.
  De Cao, Sebastian Riedel, and Edouard Grave.
  2020. A Memory Efficient Baseline for Open Do-         Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi
  main Question Answering. arXiv:2012.15156 [cs].           Chen. 2021. Learning Dense Representations of
  ArXiv: 2012.15156.                                        Phrases at Scale. arXiv:2012.12624 [cs]. ArXiv:
                                                            2012.12624.
Zhengbao Jiang, Jun Araki, Haibo Ding, and Gra-
  ham Neubig. 2020a. How Can We Know When                Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
  Language Models Know? arXiv:2012.00955 [cs].             2019a. Latent Retrieval for Weakly Supervised
  ArXiv: 2012.00955.                                       Open Domain Question Answering. In Proceedings
Zhengbao Jiang, Wei Xu, Jun Araki, and Gra-                of the 57th Annual Meeting of the Association for
  ham Neubig. 2020b. Generalizing Natural Lan-             Computational Linguistics, pages 6086–6096, Flo-
  guage Analysis through Span-relation Representa-         rence, Italy. Association for Computational Linguis-
  tions. arXiv:1911.03822 [cs]. ArXiv: 1911.03822.         tics.

Jeff Johnson, Matthijs Douze, and Hervé Jégou.         Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
   2017. Billion-scale similarity search with GPUs.        2019b. Latent retrieval for weakly supervised open
   arXiv:1702.08734 [cs]. ArXiv: 1702.08734.               domain question answering. In ACL (1), pages
                                                           6086–6096. Association for Computational Linguis-
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke           tics.
 Zettlemoyer. 2017. TriviaQA: A Large Scale Dis-
 tantly Supervised Challenge Dataset for Reading         Mike Lewis and Angela Fan. 2018. Generative Ques-
 Comprehension. In Proceedings of the 55th Annual          tion Answering: Learning to Answer the Whole
 Meeting of the Association for Computational Lin-         Question. In International Conference on Learning
 guistics (Volume 1: Long Papers), pages 1601–1611,        Representations.
 Vancouver, Canada. Association for Computational
 Linguistics.                                            Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
                                                           jan Ghazvininejad, Abdelrahman Mohamed, Omer
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick        Levy, Ves Stoyanov, and Luke Zettlemoyer.
  S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi             2019a. BART: Denoising Sequence-to-Sequence
  Chen, and Wen-tau Yih. 2020a. Dense passage              Pre-training for Natural Language Generation,
  retrieval for open-domain question answering. In         Translation, and Comprehension. arXiv:1910.13461
  EMNLP (1), pages 6769–6781. Association for              [cs, stat]. ArXiv: 1910.13461.
  Computational Linguistics.
                                                         Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel.
Vladimir Karpukhin, Barlas Oğuz, Sewon Min,               2019b. Unsupervised Question Answering by Cloze
  Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi           Translation. In Proceedings of the 57th Annual
  Chen, and Wen-tau Yih. 2020b. Dense Passage              Meeting of the Association for Computational Lin-
  Retrieval for Open-Domain Question Answering.            guistics, pages 4896–4910, Florence, Italy. Associa-
  arXiv:2004.04906 [cs]. ArXiv: 2004.04906.                tion for Computational Linguistics.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-        Patrick Lewis, Ethan Perez, Aleksandara Piktus,
  field, Michael Collins, Ankur Parikh, Chris Alberti,     Fabio Petroni, Vladimir Karpukhin, Naman
  Danielle Epstein, Illia Polosukhin, Matthew Kelcey,      Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
  Jacob Devlin, Kenton Lee, Kristina N. Toutanova,         Yih, Tim Rocktäschel, Sebastian Riedel, and
  Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob           Douwe Kiela. 2020a.        Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks.          Myle Ott, Sergey Edunov, Alexei Baevski, Angela
  arXiv:2005.11401 [cs]. ArXiv: 2005.11401.               Fan, Sam Gross, Nathan Ng, David Grangier, and
                                                          Michael Auli. 2019. fairseq: A Fast, Extensible
Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel.    Toolkit for Sequence Modeling. arXiv:1904.01038
  2020b. Question and Answer Test-Train Over-             [cs]. ArXiv: 1904.01038.
  lap in Open-Domain Question Answering Datasets.
  arXiv:2008.02637 [cs]. ArXiv: 2008.02637.              Adam Paszke, Sam Gross, Francisco Massa, Adam
                                                           Lerer, James Bradbury, Gregory Chanan, Trevor
                                                           Killeen, Zeming Lin, Natalia Gimelshein, Luca
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
                                                           Antiga, Alban Desmaison, Andreas Köpf, Edward
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
                                                           Yang, Zach DeVito, Martin Raison, Alykhan Te-
  Luke Zettlemoyer, and Veselin Stoyanov. 2019a.
                                                           jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
  Roberta: A robustly optimized bert pretraining ap-
                                                           Junjie Bai, and Soumith Chintala. 2019. PyTorch:
  proach. arXiv preprint arXiv:1907.11692.
                                                           An Imperative Style, High-Performance Deep Learn-
                                                           ing Library. arXiv:1912.01703 [cs, stat]. ArXiv:
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        1912.01703.
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  Luke Zettlemoyer, and Veselin Stoyanov. 2019b.         Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim
  RoBERTa: A Robustly Optimized BERT Pretrain-             Rocktäschel, Yuxiang Wu, Alexander H. Miller, and
  ing Approach. arXiv:1907.11692 [cs]. ArXiv:              Sebastian Riedel. 2020a. How Context Affects Lan-
  1907.11692.                                              guage Models’ Factual Predictions. In Automated
                                                           Knowledge Base Construction.
Yu A. Malkov and D. A. Yashunin. 2018. Efficient
  and robust approximate nearest neighbor search         Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
  using Hierarchical Navigable Small World graphs.         Lewis, Majid Yazdani, Nicola De Cao, James
  arXiv:1603.09320 [cs]. ArXiv: 1603.09320.                Thorne, Yacine Jernite, Vassilis Plachouras, Tim
                                                           Rocktäschel, and Sebastian Riedel. 2020b. KILT:
Sewon Min, Jordan Boyd-Graber, Chris Alberti,              a Benchmark for Knowledge Intensive Language
  Danqi Chen, Eunsol Choi, Michael Collins, Kelvin         Tasks. arXiv:2009.02252 [cs]. ArXiv: 2009.02252.
  Guu, Hannaneh Hajishirzi, Kenton Lee, Jenni-
                                                         Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
  maria Palomaki, Colin Raffel, Adam Roberts, Tom
                                                           ine Lee, Sharan Narang, Michael Matena, Yanqi
  Kwiatkowski, Patrick Lewis, Yuxiang Wu, Hein-
                                                           Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
  rich Küttler, Linqing Liu, Pasquale Minervini, Pon-
                                                           the Limits of Transfer Learning with a Unified Text-
  tus Stenetorp, Sebastian Riedel, Sohee Yang, Min-
                                                           to-Text Transformer. Journal of Machine Learning
  joon Seo, Gautier Izacard, Fabio Petroni, Lu-
                                                           Research, 21(140):1–67.
  cas Hosseini, Nicola De Cao, Edouard Grave,
  Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki,        Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
  Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun          Know What You Don’t Know: Unanswerable Ques-
  Suzuki, Martin Fajcik, Martin Docekal, Karel On-         tions for SQuAD. In Proceedings of the 56th An-
  drej, Pavel Smrz, Hao Cheng, Yelong Shen, Xi-            nual Meeting of the Association for Computational
  aodong Liu, Pengcheng He, Weizhu Chen, Jian-             Linguistics (Volume 2: Short Papers), pages 784–
  feng Gao, Barlas Oguz, Xilun Chen, Vladimir              789, Melbourne, Australia. Association for Compu-
  Karpukhin, Stan Peshterliev, Dmytro Okhonko,             tational Linguistics.
  Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad,
  and Wen-tau Yih. 2020a.          NeurIPS 2020 Ef-      Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  ficientQA Competition: Systems, Analyses and             Percy Liang. 2016. SQuAD: 100,000+ Questions
  Lessons Learned. arXiv:2101.00133 [cs]. ArXiv:           for Machine Comprehension of Text. In Proceed-
  2101.00133.                                              ings of the 2016 Conference on Empirical Methods
                                                           in Natural Language Processing, pages 2383–2392,
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and            Austin, Texas. Association for Computational Lin-
  Luke Zettlemoyer. 2019. A discrete hard EM ap-           guistics.
  proach for weakly supervised question answering.
  In EMNLP/IJCNLP (1), pages 2851–2864. Associ-          Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.
  ation for Computational Linguistics.                     How Much Knowledge Can You Pack Into the Pa-
                                                           rameters of a Language Model? arXiv:2002.08910
Sewon Min, Julian Michael, Hannaneh Hajishirzi,            [cs, stat]. ArXiv: 2002.08910.
  and Luke Zettlemoyer. 2020b.          AmbigQA:         Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He,
  Answering Ambiguous Open-domain Questions.               and Jordan Boyd-Graber. 2019.         Quizbowl:
  arXiv:2004.10645 [cs]. ArXiv: 2004.10645.                The Case for Incremental Question Answering.
                                                           arXiv:1904.04792 [cs]. ArXiv: 1904.04792.
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and
  Kyunghyun Cho. 2019. Document Expansion by             Minjoon Seo, Tom Kwiatkowski, Ankur Parikh, Ali
  Query Prediction. arXiv:1904.08375 [cs]. ArXiv:          Farhadi, and Hannaneh Hajishirzi. 2018. Phrase-
  1904.08375.                                              Indexed Question Answering: A New Challenge for
You can also read