Answering Any-hop Open-domain Questions with Iterative Document Reranking

Page created by Judy Reynolds
 
CONTINUE READING
Answering Any-hop Open-domain Questions
 with Iterative Document Reranking
 Yuyu Zhang∗,1 Ping Nie∗,2 , Arun Ramamurthy3 , Le Song1
 1
 Georgia Institute of Technology
 2
 Peking University
 3
 Siemens Corporate Technology
 yuyu@gatech.edu, ping.nie@pku.edu.cn
 arun.ramamurthy@siemens.com, lsong@cc.gatech.edu

 Abstract inherently limited to answering simple questions
 and not able to handle multi-hop questions, which
 Existing approaches for open-domain ques- require the system to retrieve and reason over evi-
arXiv:2009.07465v3 [cs.CL] 16 Mar 2021

 tion answering (QA) are typically designed dence scattered among multiple documents. In the
 for questions that require either single-hop or
 task of open-domain multi-hop QA (Yang et al.,
 multi-hop reasoning, which make strong as-
 sumptions of the complexity of questions to be 2018), the documents with the answer can have
 answered. Also, multi-step document retrieval little lexical overlap with the question and thus are
 often incurs higher number of relevant but not directly retrievable. Take the question in Fig-
 non-supporting documents, which dampens ure 1 as an example, the last paragraph contains the
 the downstream noise-sensitive reader mod- correct answer but cannot be directly retrieved us-
 ule for answer extraction. To address these ing TF-IDF. In this example, the single-hop TF-IDF
 challenges, we propose a unified QA frame-
 retriever is not able to retrieve the last supporting
 work to answer any-hop open-domain ques-
 tions, which iteratively retrieves, reranks and paragraph since it has no lexical overlap with the
 filters documents, and adaptively determines question, but this paragraph contains the answer
 when to stop the retrieval process. To improve and plays a critical role in the reasoning chain.
 the retrieval accuracy, we propose a graph- Recent studies on multi-hop QA attempt to per-
 based reranking model that perform multi- form iterative retrievals to improve the answer re-
 document interaction as the core of our itera- call of the retrieved documents. However, several
 tive reranking framework. Our method consis-
 challenges are not solved yet by existing multi-hop
 tently achieves performance comparable to or
 better than the state-of-the-art on both single- QA methods: 1) The iterative retrieval rapidly in-
 hop and multi-hop open-domain QA datasets, creases the total number of retrieved documents
 including Natural Questions Open, SQuAD and introduces much noise to the downstream
 Open, and HotpotQA. reader module for answer extraction. Typically,
 the downstream reader module is noise-sensitive,
 1 Introduction which works poorly when taking noisy documents
 as input or missing critical supporting documents
 Open-domain question answering (QA) requires a
 with the answer (Nie et al., 2019). This requires the
 system to answer factoid questions using a large
 QA system to reduce relevant but non-supporting
 text corpus (e.g., Wikipedia or the Web) without
 documents fed into the reader module. However, to
 any pre-defined knowledge schema. Most state-of-
 answer open-domain multi-hop questions, it is nec-
 the-art approaches for open-domain QA follow the
 essary to iteratively retrieve documents to increase
 retrieve-and-read pipeline initiated by Chen et al.
 the overall recall of supporting documents. This
 (2017), using a retriever module to retrieve rele-
 dilemma poses a challenge for the retrieval phase
 vant documents, and then a reader module to ex-
 of open-domain QA systems; 2) Existing multi-hop
 tract answer from the retrieved documents. These
 QA methods such as MUPPET (Feldman and El-
 approaches achieve prominent results on single-
 Yaniv, 2019) and Multi-step Reasoner (Das et al.,
 hop QA datasets such as SQuAD (Rajpurkar et al.,
 2019a) perform a fixed number of retrieval steps,
 2016), whose questions can be answered using a
 which make strong assumptions on the complex-
 single supporting document. However, they are
 ity of open-domain questions and perform fixed
 ∗
 Both authors contributed equally to this work. number of retrieval steps. In real-world scenar-
Q: In what year was the actress who was starred Question Updater Answer
 in "Streak" with Rumer Willis born? Brittany Snow 1986
 rerank and filter documents, and adaptively deter-
 Paragraphs: Gold TF-IDF ABR IDR (ours) mine when to stop the retrieval process. In this way,
 Streak is a 2008 American coming-of-age…, our method can significantly reduce the noise intro-
 written by Kelly Fremon ..., and starring Brittany Yes ✓ ✓ ✓
 Snow and Rumer Willis. … rank 2 positive positive
 duced by multi-round retrievals and handle open-
 Hello Again is an upcoming American musical domain questions that require different number of
 film directed by…, based on the musical of same No ✗ ✗ ✓
 name by Michael John LaChiusa. The film stars rank 5 positive negative reasoning steps. To avoid the bias of lexical overlap
 Audra McDonald,…, and Rumer Willis …
 in identifying supporting documents, our frame-
 Sorority Row is a 2009 American slasher film
 directed by Stewart Hendler and starring Briana No ✗ ✗ ✓ work considers the question and retrieved docu-
 rank 6 positive negative
 Evigan, Leah Pipes, Rumer Willis, and …
 ments as a whole and models the multi-document
 Brittany Snow (born March 9, 1986) is Yes ✗ ✗ ✓ interactions to improve the accuracy of classifying
 an American actress, producer, and singer. miss negative positive

 supporting documents.
Figure 1: An example open-domain multi-hop ques- As illustrated in Figure 2, our method constructs
tion from the HotpotQA dev set, where the question a document graph linked by shared entities to prop-
has only partial clues to retrieve supporting documents. agate information using a graph attention network
The first and fourth paragraphs are gold supporting doc- (GAT). By leveraging the multi-document infor-
uments, and the remaining two paragraphs are rele- mation, our reranking model has more knowledge
vant but non-supporting documents. ABR stands for
 to differentiate supporting documents from irrele-
ALBERT-base reranker, which serves as a reference
of the retrieval performance of existing multi-hop QA vant documents. After initial retrieval, our method
methods that independently consider the relevance of updates the question at every retrieval step with a
each document to the question. The 3 and 7 symbols text span extracted from the retrieved documents,
mark whether the retriever correctly identifies the docu- and then use the updated question as query to re-
ment as a supporting / non-supporting one. Below each trieve complementary documents, which are added
symbol, we annotate the output of the corresponding to the document graph for a new round of interac-
retriever with regard to the document, where “positive” tion. The reranking model is reused to score the
(“negative”) means that the retriever classifies it as a
 documents again and filter the most irrelevant ones.
supporting (non-supporting) document.
 The maintained high-quality shortlist of remaining
 documents are then fed into the Reader Module to
 determine whether the answer exists in them. If
ios, open-domain questions may require different
 so, the retrieval process ends and the QA system
number of reasoning steps; 3) The relevance of
 delivers the answer span extracted by the Reader
each retrieved document to the question is inde-
 Module as the predicted answer. Otherwise, the
pendently considered. As exemplified in Figure 1,
 retrieval process continues to the next hop.
ABR stands for ALBERT-base reranker, which
 Our contributions are summarized as follows:
serves as a reference of the retrieval performance
 • Noise control for iterative retrieval: We propose
of existing multi-hop QA methods that indepen-
 a novel QA method to iteratively retrieve, rerank
dently consider the relevance of each document
 and filter documents, and adaptively determine
to the question. Without considering multiple re-
 when to stop the retrieval process. Our method
trieved documents as a whole, these methods can
 maintains a high-quality shortlist of remaining
be easily biased to the lexical overlap between each
 documents, which significantly reduces the noise
document and the question, and incorrectly classify
 introduced to the downstream reader module for
non-supporting documents as supporting evidence
 answer extraction. Thus, the downstream reader
(such as the middle two non-supporting paragraphs
 module can extract the answer span with higher
in Figure 1, which have decent lexical overlap with
 accuracy.
the question) and vice versa (such as the bottom
paragraph in Figure 1, which has no lexical over- • Unified framework for any-hop open-domain QA:
lap with the question but is a critical supporting We propose a unified framework that does not
document that contains the answer). require to pre-determine the complexity of input
 To address the challenges above, we introduce questions. Different from existing QA methods
a unified QA framework for answering any-hop that are specifically designed for either single-
open-domain questions named Iterative Document hop or fixed-hop questions, our method can adap-
Reranking (IDR). Our framework learns to itera- tively determine the termination of retrieval and
tively retrieve documents with updated question, answer any-hop open-domain questions.
• Multi-document interaction: We construct entity- The nature of multi-hop questions poses signifi-
 linked document graph and employ graph at- cant challenge for retrieving supporting documents,
 tention network for multi-document interaction, which is crucial to the downstream QA perfor-
 which boosts up the reranking performance. To mance. To address this challenge, we argue that it
 the best of our knowledge, we are the first to pro- is necessary to iteratively retrieve, rerank and filter
 pose graph-based document reranking method documents, so that we can maintain a high-quality
 for open-domain multi-hop QA. shortlist of documents. To this end, we propose
 the Iterative Document Reranking (IDR) method,
2 Overview which is introduced in the next section.

2.1 Problem Definition 2.2 System Overview
Given a factoid question, the task of open-domain
 We illustrate the overview of our IDRQA system
question answering (QA) is to answer it using a
 in Figure 2. The IDRQA system first uses a given
large corpus which can have millions of documents
 question as query to retrieve top |D| documents
(e.g., Wikipedia) or even billions (e.g., the Web).
 using TF-IDF. To construct the document graph,
Let the corpus C = {d1 , d2 , . . . , d|C| } consist of
 IDR extracts the entities from the question and re-
|C| documents as the basic retrieval units1 . Each
 trieved documents using an off-the-shelf Named
document di can be viewed as a sequence of tokens
 (i) (i) (i) Entity Recognition (NER) system and connects
t1 , t2 , . . . , t|di | . Formally, given a question q, the two documents if they have shared entities. The
 (j) (j) (j)
task is to find a text span ts , ts+1 , . . . , te from graph-based reranking model takes the document
one of the documents dj that can answer the ques- graph as input to score each document, filter the
tion2 . For open-domain multi-hop QA, the final lowest-scoring documents, and adaptively deter-
documents with the answer are typically multiple mines whether to continue the retrieval process. At
hops away from the question, i.e., the system is re- every future retrieval step, IDR updates the ques-
quired to find seed documents and subsequent sup- tion with an extracted text span from the retrieved
porting documents in order of a chain or directed documents as a new query to retrieve more docu-
graph to locate the final documents. The retrieved ments. Once the retrieval is done, the final highest-
documents are usually connected via shared enti- scoring documents are concatenated to feed into
ties or semantic similarities, and the formed chain the downstream reader module (Devlin et al., 2019;
or directed graph of documents can be viewed as Lan et al., 2020) for answer extraction.
the reasoning process for answering the question. To describe the pipeline of our IDRQA system
 Note that the task of open-domain multi-hop QA more precisely, we provide a concise algorithm
that we describe above is much different from the that summarizes the inference procedure of the
few-document setting of multi-hop QA (Qi et al., iterative document reranking (IDR) phase, as in
2019), where the QA system is provided with a Algorithm 1. We maintain a set of retrieved docu-
tiny set of documents that consists of all the gold ments D. For each retrieval step, we retrieve the
supporting documents together with several irrel- top N documents according to the question q and
evant “distractor” documents. The few-document then append all the newly retrieved documents to
setting is designed to test the system’s capability of the current set of retrieved documents. Then we
multi-hop reasoning given all of the gold support- use the graph-based reranking model M to rerank
ing documents, but this is far from being realistic. those documents, get top K of them, and filter the
A real-world open-domain QA system has to locate other ones. This shortlist of retrieved documents
the necessary supporting documents from a large together with the question q are then fed in to the
corpus on its own, which is especially challenging downstream reader module R for answer extrac-
for multi-hop questions since the indirect support- tion. Note that the reader module has the option
ing documents are not easily retrievable given the to predict that there is no answer to be extracted
question itself. from the provided documents. If the reader mod-
 ule does predict no answer, we use the question
 1
 We use the natural paragraphs as the basic retrieval units. updater model U to update the question with an ex-
 2
 In this work, we focus on the extractive or span-based QA
setting, but the problem definition and our proposed method tracted span and move to the next hop of retrieval.
can be generalized to other QA settings as well. The while loop continues until a valid answer is
F F
 -ID -ID Top Documents
 TF TF

 Extracted
 Question Question
 Span
 0.91 0.89 0.88

 Question
 0.36 0.89
 0.21 0.88
 Graph-based scoring Graph-based scoring 0.13 Reader
 Reranking 0.81 Reranking 0.91 Module
 Model Model

 0.67 0.23 0.32

 more Y Question more N Answer
 hops? Updater hops? Span

 Iterative Document Reranking Question Answering

Figure 2: An overview of the IDRQA system, which consists of an Iterative Document Reranking (IDR) phase
and a question answering phase. Given an open-domain question, IDR iteratively retrieves, reranks and filters
documents, and adaptively determines when to stop the retrieval process. After the initial retrieval, IDR updates
the question with an extracted text span as a new query to retrieve more documents at every iteration. Once
the retrieval is done, the final highest-scoring documents are fed into the downstream reader module for answer
extraction.

extracted by the reader module or the maximum we pad or truncate the input tokens to the length of
number of retrieval hops H is reached. L. We then concatenate all documents’ contextual
 representation vectors as v ∈ RL|D|×h .
3 IDRQA System
 Graph Attention. After we obtain the question-
In this section, we describe the models in our dependent encoding of each document, we em-
IDRQA system, including the graph-based rerank- ploy a Graph Attention Network (GAT; Veličković
ing model, the question updater, and the down- et al. (2018)) to propagate information on the doc-
stream reader module. ument graph, where two documents are connected
 if they have shared entities. To be more spe-
3.1 Graph-based Reranking Model cific, for each shared entity Ei in the document
The graph-based reranking model (Figure 3) is de- graph, we perform pooling over its token embed-
signed to precisely identify the supporting docu- dings from v to produce the entity embedding as
 (i) (i) (i)  (i)
ments in the document graph. We present the com- ei = Pooling t1 , t2 , . . . , t|Ei | where tj is the
ponents of this reranking model as follows. embedding of the j-th token in Ei , and |Ei | is the
Contextual Encoding. Given a question q and |D| number of tokens in Ei . We use both mean- and
documents retrieved by TF-IDF, we concatenate max-pooling, thus we have ei ∈ R2h . Inspired by
the tokens of the question and each document to Qiu et al. (2019), we apply a dynamic soft mask on
feed into the pre-trained language model as: the entities, serving as the information “gatekeeper”
 which assigns more weights to entities pertaining
 (k) (k)
Iq,dk = [CLS] q1 . . . q|q| [SEP] t1 . . . t|dk | [SEP], to the question. The soft mask applied on each
 entity Ei is computed as
where |q| and |dk | denote the number of tokens in !
the question q and the document dk , respectively. qV ei
 mi = σ √ (1)
[CLS] and [SEP] are special tokens used in pre- 2h
trained language models such as BERT (Devlin gi = mi ei , (2)
et al., 2019) and ALBERT (Lan et al., 2020). Thus
we independently encode each document dk along where q ∈ R2h is the concatenated mean- and
with the question q to obtain the contextual rep- max-pooling of the question token embeddings,
resentation vector vq,dk ∈ RL×h , where L is the and V ∈ R2h×2h is a linear projection matrix, σ(·)
maximum length of the input tokens I, and h is the is the sigmoid function, and gi is the masked entity
embedding size. For efficient batch computation, embedding. We then use GAT to disseminate infor-
Algorithm 1: Iterative Document Reranking (inference)
 input :A textual question q; the maximum number of retrieval hops H; the number of retrieved
 documents at each hop N ; the number of remaining documents after each reranking
 process K; the graph-based reranking model M; the reader module R; the question
 updater model U
 output :Predicted answer a
 1 h ← 0
 2 D ← Ø
 3 while h
more hops? 1/0

 Retrieve New Documents
 Reader Module

 Document
 Updated Question Representation
 Classifier 1/0 Classifier 1/0 ··· Classifier 1/0

 Fused
 Representation
 ··· ··· ··· ··· ··· ··· ···

 Question Updater
 Multi-document Fusion Layer

 Concatenated
 + Top Documents Representation
 ··· ··· ··· ··· ··· ··· ···

 Y filter

 N Graph
 more hops? Attention

 Contextual
 Graph-based Representation
 ··· ··· ··· ··· ··· ··· ···

 Reranking Model

 Pre-trained Language Model Pre-trained Language Model Pre-trained Language Model

 + Document Graph Input Tokens ··· ··· ··· ··· ··· ··· ···

 Q # Q ! Q "
 Iterative Document Reranking Graph-based Reranking Model

Figure 3: As the core of our IDR framework, the Graph-based Reranking Model first encodes the question and each
retrieved document with pre-trained language model to generate contextual representations, and uses the shared
entities to propagate information using a Graph Attention Network (GAT). After the entity-entity interaction across
multiple documents, the updated entity representations with the original contextual encodings are fed into the
fusion layer for further interaction. Finally, the reranking model takes pooled document representations to score
each document and filter low-scoring documents. The maintained high-quality shortlist of remaining documents
are then fed into the Reader Module to determine whether the answer exists in them. If so, the retrieval process ends
and the QA system delivers the answer span extracted by the Reader Module as the predicted answer. Otherwise,
the retrieval process continues to the next hop.

ing GoldEn Retriever (Qi et al., 2019), we also question q and the final top K reranked documents
leverage a QA model to extract the clue span by to feed into the reader module. At inference time,
predicting its start and end tokens. Formally, given the reader finds the best candidate answer span by
the question q and the reranked top K documents
DK = {d1 , d2 , . . . , dK }, the Question Updater arg max Pistart Pjend , (10)
 i,j, i≤j
generates the clue span C and concatenate C with
the original question q as where Pistart , Pjend denote the probability that the
 i-th and j-th tokens are the start and end positions
 C = U(q, DK ) (8)
 in the concatenated text, respectively, of the answer
 q = concatenate(q, [SEP], C), (9) span. During inference, there is no guarantee that
where [SEP] is a special separator token. In or- the answer span exists in the reader’s input text.
der to train the Question Updater, we construct To handle the no-answer cases, the reader predicts
each training example as a triple of < q, DK , C >, Pna as the probability of having no answer span,
where q is the question, DK = {d1 , d2 , . . . , dK } is and compares Pna with maxi,j, i≤j Pistart Pjend to
the set of top K retrieved documents reranked by determine the output between a special no-answer
our graph-based reranking model. prediction and the best candidate answer span.

3.3 Reader Module 4 Experiments
In this work, we mainly focus on the retrieval phase, Experimental settings. We conduct all the experi-
and use a standard span-based reader module as in ments on a GPU-enabled (Nvidia RTX 6000) Linux
BERT (Devlin et al., 2019) and ALBERT (Lan machine powered by Intel Xeon Gold 5125 CPU at
et al., 2020). We concatenate the tokens of the 2.50GHz with 384 GB RAM. For the graph-based
Answer Supporting Fact Paragraph
Method
 EM F1 EM F1 EM
Cognitive Graph (Ding et al., 2019) 37.6 49.4 23.1 58.5 57.8
DecompRC (Min et al., 2019b) – 43.3 – – –
MUPPET (Feldman and El-Yaniv, 2019) 31.1 40.4 17.0 47.7
GoldEn Retriever (Qi et al., 2019) – 49.7 – – –
DrKIT (Dhingra et al., 2020) 35.7 46.6 – – –
Semantic Retrieval MRS (Nie et al., 2019) 46.5 58.8 39.9 71.5 63.9
Transformer-XH (Zhao et al., 2020) 50.2 62.4 42.2 71.6 –
Recurrent Retriever (Asai et al., 2020) 60.5 73.3 49.3 76.1 75.7
IDRQA (ours) 62.9 75.9 51.3 79.1 79.8

 Table 1: Performance on the HotpotQA full wiki dev set.

 Answer Supporting Fact Joint
Method
 EM F1 EM F1 EM F1
DecompRC (Min et al., 2019b) 30.0 40.6 – – – –
QFE (Nishida et al., 2019) 28.6 38.0 14.2 44.3 8.6 23.1
Cognitive Graph (Ding et al., 2019) 37.1 48.8 22.8 57.6 12.4 34.9
MUPPET (Feldman and El-Yaniv, 2019) 30.6 40.2 16.6 47.3 10.8 27.0
GoldEn Retriever (Qi et al., 2019) 37.9 48.5 30.6 64.2 18.0 39.1
DrKIT (Dhingra et al., 2020) 42.1 51.7 37.0 59.8 24.6 42.8
Semantic Retrieval MRS (Nie et al., 2019) 45.3 57.3 38.6 70.8 25.1 47.6
Transformer-XH (Zhao et al., 2020) 51.6 64.0 40.9 71.4 26.1 51.2
Recurrent Retriever (Asai et al., 2020) 60.0 73.0 49.0 76.4 35.3 61.1
Hierarchical Graph Network (Fang et al., 2020) 59.7 71.4 51.0 77.4 37.9 62.2
IDRQA (ours) 62.5 75.9 51.0 78.9 36.0 63.9

 Table 2: Performance on the HotpotQA full wiki test set.

reranking model, we set the maximum token se- 2019) and SQuAD Open (Chen et al., 2017). On
quence length L = 250, the number of retrieved all the three datasets, we focus on the full wiki
documents |D| = 8, the maximum number of en- open-domain QA setting, which requires the sys-
tities |E| = 120. The graph attention module has tem to retrieve evidence paragraphs from the entire
T = 2 GAT layers. The maximum number of re- Wikipedia and extract the answer span from the
trieval hops H = 3. The top K = 4 reranked retrieved paragraphs.
paragraphs are sent into the downstream reader
module. We implement our system based on Hug- Following the train/dev/test splits of Natural
gingFace’s Transformers (Wolf et al., 2019) using Questions Open and SQuAD Open in previous
PyTorch (Paszke et al., 2019). Following the previ- works (Lee et al., 2019; Karpukhin et al., 2020),
ous state-of-the-art method (Fang et al., 2020), we we use the original validation set as our test set
use ALBERT-xxlarge (Lan et al., 2020) as the pre- and keep 10% training set as our dev set. Natu-
trained language model. We use AdamW (Wolf ral Questions Open and SQuAD open consist of
et al., 2019) as the optimizer and tune the initial single-hop questions, while HotpotQA consists of
learning rate between 1.5e−5 and 2.5e−5. 113K crowd-sourced multi-hop questions that re-
 quire Wikipedia introduction paragraphs to answer.
Datasets. We evaluate our method on three open- In the train and dev splits of HotpotQA, each ques-
domain QA benchmark datasets: HotpotQA (Yang tion for training comes with two gold supporting
et al., 2018), Natural Questions Open (Lee et al., paragraphs annotated by the crowd workers. Thus
Method NQ SQuAD
 DrQA (Chen et al., 2017) – 29.8
 R3 (Wang et al., 2018a) – 29.1
 Multi-step Reasoner (Das et al., 2019a) – 31.9
 BERTserini (Yang et al., 2019) – 38.6
 MUPPET (Feldman and El-Yaniv, 2019) – 39.3
 Multi-passage BERT (Wang et al., 2019b) – 53.0
 ORQA (Lee et al., 2019) 33.3 20.2
 Recurrent Retriever (Asai et al., 2020) 31.7 56.5
 Dense Passage Retriever (Karpukhin et al., 2020) 41.5 36.7
 IDRQA (ours) 45.5 56.6

 Table 3: Answer EM scores on the test set of Natural Questions Open and SQuAD Open.

we can evaluate the retrieval performance on the the HotpotQA dev set and the hidden test set on
dev set of HotpotQA dataset. the official leaderboard (on May 21, 2020). Our
Evaluation metrics. We evaluate both the para- reranking model also facilitates the downstream
graph reranking performance and the overall QA reader module, which achieves new state-of-the-art
performance. Following existing studies (Asai QA performance in terms of Answer EM and F1
et al., 2020; Nie et al., 2019; Nishida et al., 2019), scores.
we evaluate the paragraph-level retrieval accuracy For the single-hop QA task, we evaluate our
using the Paragraph Exact Match (EM) metric, method on Natural Questions (NQ) Open and
which compares the top 2 paragraphs with the gold SQuAD Open datasets. We summarize the per-
supporting paragraphs. For QA performance, we formance in Table 3. For NQ Open, we follow the
report standard answer Exact Match (EM) and F1 previous work DPR (Karpukhin et al., 2020) to use
scores to measure the overlap between the gold dense retriever instead of TF-IDF for document
answer and the extracted answer span. retrieval. Our method also achieves state-of-the-
 art QA performance, which shows the robustness
4.1 Overall Results of our method across different open-domain QA
We evaluate our method on both single-hop and datasets.
multi-hop open-domain QA datasets. Note that our
 4.2 Detailed Analysis
method is hop-agnostic, i.e., we consistently use
the same setting of H = 3 as the maximum number Ablation study. To investigate the effectiveness
of retrieval hops for all the three datasets. of each module in IDRQA, we compare the per-
 For the multi-hop QA task, we report perfor- formance of several variants of our system on Hot-
mance on both dev and test sets of HotpotQA potQA full wiki dev set. As shown in Table 4,
dataset. As reported in Table 1, we compare the once we disable the Iterative Reranking, Graph-
performance of our system IDRQA with various based Reranking and Question Updater module,
existing methods on the HotpotQA full wiki dev set. both the paragraph reranking and QA performance
Since the golden supporting paragraphs are only drop significantly. Notably, there is a 11 points
labeled on the train / dev splits of the HotpotQA drop in Paragraph EM decrease and a 12 points
dataset, we can only report the paragraph EM met- drop in Answer F1 when Graph-based Reranking
ric on the dev set. Notably, IDRQA achieves over 4 is removed from IDRQA. This shows the impor-
points improvement on Paragraph EM compared to tance of the graph-based reranking model in our
the previous best retriever model (Asai et al., 2020), system. To further study the impact of these mod-
demonstrating the effectiveness of our graph-based ules, we decompose the QA performance into ques-
iterative reranking method. In Table 2, we compare tion categories Bridge and Comparison. We find
our system with various existing methods on Hot- that the QA performance on the bridge questions
potQA full wiki test set. IDRQA outperforms all drops much more significantly than that on the
published and previous unpublished methods on comparison questions. This is because the com-
Bridge: In what year was the actress who was starred in "Streak" with Rumer Willis born? Supporting
 Streak is a 2008 American … film directed by Demi Moore, … and starring Brittany Snow and Rumer Willis… Yes
 Sorority Row is a 2009 American slasher film … starring Briana Evigan, Leah Pipes, Rumer Willis, … No
 Brittany Anne Snow (born March 9, 1986) is an American actress, producer, and singer. Yes
 IDRQA answer: 1986 ✓ ABR answer: 2009 ✗

 Bridge: What screenwriter with credits for "Evolution" co-wrote a film starring Nicolas Cage and Téa Leoni? Supporting
 David Weissman is a screenwriter and director. His film credits include "The Family Man" (2000), "Evolution"
 Yes
 (2001), and "When in Rome" (2010).
 The Family Man is a 2000 American romantic comedy-drama film directed by Brett Ratner, written by David
 Yes
 Diamond and David Weissman, and starring Nicolas Cage and Téa Leoni.
 IDRQA answer: David Weissman ✓ ABR answer: David Weissman ✓

 Comparison: Who is older, Annie Morton or Terry Richardson? Supporting
 Annie Morton (born October 8, 1970) is an American model born in Pennsylvania… Yes
 Terrence "Uncle Terry" Richardson (born August 14, 1965) is an American fashion and portrait photographer… Yes
 IDRQA answer: Annie Morton ✗ ABR answer: Annie Morton ✗

 Figure 4: Case study of example questions with supporting paragraphs from HotpotQA dev set.

 Bridge (79.9%) Comparison (20.1%) Full Dev (100%)
Ablation Setting Answer Answer Answer Paragraph
 EM F1 EM F1 EM F1 EM
IDRQA w/ ALBERT-base reranker 58.1 73.2 71.4 77.9 60.7 74.2 73.8
 w/o Graph-based Reranking 47.4 59.5 70.1 76.4 52.1 62.9 62.6
 w/o Iterative Reranking 49.8 61.7 70.5 77.3 54.3 64.5 65.2
 w/o Question Updater 56.6 71.4 71.3 77.8 59.8 72.9 70.3

 Table 4: Ablation study of our system in different settings on HotpotQA full wiki dev set.

parison questions require to compare two entities Paragraph Answer
mentioned in the question (Yang et al., 2018), thus Retrieval % of Questions
 EM EM F1
iterative retrieval and multi-hop reasoning may not
be necessary. In contrast, to answer the bridge 1-step 21.8 88.4 70.3 81.9
questions which often have missing entities, our 2-step 63.2 76.9 62.0 75.9
iterative reranking method is of crucial importance. 3-step 5.3 39.6 45.2 59.8
 4-step (max) 9.7 22.0 37.7 50.4
Impact of retrieval steps. IDRQA aggregates the
document scores to check whether the collected Table 5: Distribution of the selected retrieval steps on
evidence is enough to answer the question, and HotpotQA dev set, which is adaptively determined by
adaptively determines when to stop the retrieval IDRQA.
process. We investigate the number of retrieval
steps selected by IDRQA, and report its distribu- Case study and limitations. We showcase several
tion with breakdown performance in Table 5. Over example questions with answers from IDRQA and
60% questions are answered with 2-step retrieval. the baseline ALBERT-base reranker (ABR) in Fig-
About 20% questions are answered with 1-step re- ure 4. The first case is a hard bridge question where
trieval, which is close to the ratio of the comparison ABR extracts the wrong answer from a relevant but
questions that may not need iterative retrieval. For non-supporting paragraph, showing the advantage
questions that IDRQA selects to perform over 2- of iterative reranking in our system. The second
step retrieval, a significant drop on both reranking question is correctly answered by both IDRQA and
and QA performance is observed, showing that ABR, since it provides sufficient clues to retrieve
these questions are the hardest ones in HotpotQA. both paragraphs. The final case is a comparison
question that requires numerical reasoning, which collect relevant documents. BERT Reranker (Das
is not correctly answered by both IDRQA and ABR. et al., 2019b), DrKIT (Dhingra et al., 2020) and
This shows the limitation of our system, and we Transformer-XH (Zhao et al., 2020) employ the en-
plan to explore the combination of multi-hop and tities in the question and the retrieved documents to
numerical reasoning in future work. link additional documents, which expands the one-
 shot retrieval results and improves the evidence cov-
5 Related Work erage. GoldEn Retriever (Qi et al., 2019) adopts it-
 erative TF-IDF retrieval by generating a new query
Open-domain QA. For text-based QA, the QA
 for each retrieval step. Graph Retriever (Min et al.,
system is provided with semi-structured or unstruc-
 2019a) employs entity linking to iteratively retrieve
tured text corpora as the knowledge source to find
 and construct a graph of documents. These itera-
the answer. Text-based QA dates back to the QA
 tive retrieval methods mitigate the recall problem
track evaluations of the Text REtrieval Conference
 of one-shot retrieval, however, without iterative
(TREC) (Voorhees and Tice, 2000). Traditional ap-
 reranking and filtering process, the expanded docu-
proaches for text-based QA typically use a pipeline
 ments inevitably introduce noise to the downstream
of question parsing, answer type classification, doc-
 QA model. Multi-step Reasoner (Das et al., 2019a)
ument retrieval, answer candidate generation, and
 and MUPPET (Feldman and El-Yaniv, 2019) read
answer reranking, such as the famous IBM Watson
 retrieved documents to reformulate the query in
system which beat human players on the Jeopardy!
 latent space for iterative retrieval. However, these
quiz show (Ferrucci et al., 2010). However, such
 embedding-based retrieval methods have difficul-
pipeline-based QA systems require heavy engineer-
 ties to capture the lexical information in entities
ing efforts and only work on specific domains. The
 due to the compression of information into embed-
open-domain QA task was originally proposed and
 ding space. Moreover, these methods perform a
formalized in Chen et al. (2017), which builds a
 fixed number of retrieval steps, which are not able
simple pipeline with a TF-IDF retriever module and
 to handle questions that require arbitrary hops of
a RNN-based reader module to produce answers
 reasoning. Recurrent Retriever (Asai et al., 2020)
from the top 5 retrieved documents. Different from
 supports adaptive retrieval steps, but can only se-
the machine reading comprehension task (MRC)
 lect one document at each step and has no inter-
that provides a single paragraph or document as
 actions among documents outside of the retrieval
the evidence (Rajpurkar et al., 2016), open-domain
 chain. In contrast, our method leverages document
QA is more challenging since the retrieved docu-
 graph instead of chain to propagate information,
ments are inevitably noisy. Recent works on open-
 reranks and filters documents at each hop of re-
domain QA largely follow the retrieve-and-read ap-
 trieval, and terminates the retrieval process accord-
proach, and have made prominent improvement on
 ing to the number of positive retrieved documents
both the retriever module (Lee et al., 2019; Wang
 in the graph.
et al., 2018a; Nie et al., 2019) and the reader mod-
ule (Wang et al., 2018b, 2019a; Ni et al., 2019). Graph Neural Networks for QA. Encouraged
These approaches simply perform one-shot docu- by the success of convolutional neural networks
ment retrieval to handle single-hop questions. How- (CNNs) and recurrent neural networks (RNNs),
ever, for complicated questions that require multi- graph neural networks (GNNs) are proposed to
hop reasoning, these approaches are not applicable generalize both of them and organize the connec-
since they fail to collect necessary evidence scat- tions of neurons according to the structure of the
tered among multiple documents. input data (Battaglia et al., 2018; Wu et al., 2020).
Open-domain multi-hop QA. HotpotQA (Yang The main idea is to generate a node’s represen-
et al., 2018) is crowd-sourced over Wikipedia as tation by aggregating its own features and neigh-
the largest free-form dataset for open-domain multi- bors’ features. Similar to CNNs, in GNNs, mul-
hop QA to date. Recently, a variety of approaches tiple graph convolutional layers can be stacked to
have been proposed to address the multi-hop chal- produce high-level node representations. Many
lenge. DecompRC (Min et al., 2019b) decomposes popular variants of GNNs are proposed, such as
a multi-hop question into simpler sub-questions GraphSage (Hamilton et al., 2017), Graph Convo-
and leverages single-hop QA models to answer lutional Network (GCN) (Kipf and Welling, 2017),
it, which still uses one-shot TF-IDF retrieval to Graph Attention Network (GAT) (Veličković et al.,
2018), etc. GNNs have achieved great success tional inductive biases, deep learning, and graph net-
on graph-related tasks such as node classification, works. arXiv preprint arXiv:1806.01261.
node regression and graph classification. Recently, Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Graph neural networks (GNNs) have been shown Bordes. 2017. Reading Wikipedia to answer open-
effective on knowledge-based QA tasks by reason- domain questions. In Proceedings of the 55th An-
ing over graphs (De Cao et al., 2019; Sorokin and nual Meeting of the Association for Computational
 Linguistics (Volume 1: Long Papers), pages 1870–
Gurevych, 2018; Zhang et al., 2018). Recent stud- 1879, Vancouver, Canada. Association for Computa-
ies on text-based QA also leverage GNNs for multi- tional Linguistics.
hop reasoning. HDE-Graph (Tu et al., 2019) con-
structs the graph with entity and document nodes to Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,
 and Andrew McCallum. 2019a. Multi-step retriever-
enable rich information interaction. CogQA (Ding reader interaction for scalable open-domain question
et al., 2019) extracts candidate answer spans and answering. In International Conference on Learn-
next-hop entities to build the cognitive graph for ing Representations.
reasoning. HGN (Li et al., 2019) employs a hier-
 Rajarshi Das, Ameya Godbole, Dilip Kavarthapu,
archical graph that consists of paragraph, sentence Zhiyu Gong, Abhishek Singhal, Mo Yu, Xiaoxiao
and entity nodes for reasoning on different granular- Guo, Tian Gao, Hamed Zamani, Manzil Zaheer,
ities. DFGN (Qiu et al., 2019) introduces a fusion and Andrew McCallum. 2019b. Multi-step entity-
layer on top of the entity graph with a mask predic- centric information retrieval for multi-hop question
 answering. In Proceedings of the 2nd Workshop
tion module. Graph Retriever (Min et al., 2019a) on Machine Reading for Question Answering, pages
uses graph convolution network to fuse informa- 113–118, Hong Kong, China. Association for Com-
tion on the entity-linked graph of passages. These putational Linguistics.
GNN-based methods serve as the reader module Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.
to extract answers from a few documents. In con- Question answering by reasoning across documents
trast, our work employs graph attention network, with graph convolutional networks. In Proceed-
a popular GNN variant, in the retriever module. ings of the 2019 Conference of the North American
 Chapter of the Association for Computational Lin-
To the best of our knowledge, we are the first to
 guistics: Human Language Technologies, Volume 1
propose GNN-based document reranking method (Long and Short Papers), pages 2306–2317, Min-
for open-domain multi-hop QA. neapolis, Minnesota. Association for Computational
 Linguistics.
6 Conclusion
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
In this paper, we present a unified QA framework Kristina Toutanova. 2019. BERT: Pre-training of
 deep bidirectional transformers for language under-
that can answer any-hop open-domain questions. standing. In Proceedings of the 2019 Conference
Our framework iteratively retrieves, reranks and of the North American Chapter of the Association
filters documents with a graph-based reranking for Computational Linguistics: Human Language
model, and adaptively decides how many steps of Technologies, Volume 1 (Long and Short Papers),
 pages 4171–4186, Minneapolis, Minnesota. Associ-
retrieval and reranking are needed for a multi-hop ation for Computational Linguistics.
question. Our method consistently achieves promis-
ing performance on both single-hop and multi-hop Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachan-
open-domain QA datasets, including Natural Ques- dran, Graham Neubig, Ruslan Salakhutdinov, and
 William W. Cohen. 2020. Differentiable reasoning
tions Open, SQuAD Open, and HotpotQA. over a virtual knowledge base. In International Con-
 ference on Learning Representations.

References Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,
 and Jie Tang. 2019. Cognitive graph for multi-hop
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, reading comprehension at scale. In Proceedings of
 Richard Socher, and Caiming Xiong. 2020. Learn- the 57th Annual Meeting of the Association for Com-
 ing to retrieve reasoning paths over wikipedia graph putational Linguistics, pages 2694–2703, Florence,
 for question answering. In International Conference Italy. Association for Computational Linguistics.
 on Learning Representations.
 Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuo-
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, hang Wang, and Jingjing Liu. 2020. Hierarchical
 Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma- graph network for multi-hop question answering. In
 teusz Malinowski, Andrea Tacchetti, David Raposo, Proceedings of the 2020 Conference on Empirical
 Adam Santoro, Ryan Faulkner, et al. 2018. Rela- Methods in Natural Language Processing (EMNLP),
pages 8823–8838, Online. Association for Computa- Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian
 tional Linguistics. McAuley. 2019. Learning to attend on essential
 terms: An enhanced retriever-reader model for open-
Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para- domain question answering. In Proceedings of the
 graph retrieval for open-domain question answering. 2019 Conference of the North American Chapter of
 In Proceedings of the 57th Annual Meeting of the the Association for Computational Linguistics: Hu-
 Association for Computational Linguistics, pages man Language Technologies, Volume 1 (Long and
 2296–2309, Florence, Italy. Association for Compu- Short Papers), pages 335–344, Minneapolis, Min-
 tational Linguistics. nesota. Association for Computational Linguistics.
David Ferrucci, Eric Brown, Jennifer Chu-Carroll, Yixin Nie, Songhe Wang, and Mohit Bansal. 2019.
 James Fan, David Gondek, Aditya A Kalyanpur, Revealing the importance of semantic retrieval for
 Adam Lally, J William Murdock, Eric Nyberg, John machine reading at scale. In Proceedings of the
 Prager, et al. 2010. Building watson: An overview 2019 Conference on Empirical Methods in Natu-
 of the deepqa project. AI magazine, 31(3):59–79. ral Language Processing and the 9th International
 Joint Conference on Natural Language Processing
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. (EMNLP-IJCNLP), pages 2553–2566, Hong Kong,
 Inductive representation learning on large graphs. In China. Association for Computational Linguistics.
 Advances in neural information processing systems,
 pages 1024–1034. Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata,
 Atsushi Otsuka, Itsumi Saito, Hisako Asano, and
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Junji Tomita. 2019. Answering while summarizing:
 Wu, Sergey Edunov, Danqi Chen, and Wen- Multi-task learning for multi-hop QA with evidence
 tau Yih. 2020. Dense passage retrieval for extraction. In Proceedings of the 57th Annual Meet-
 open-domain question answering. arXiv preprint ing of the Association for Computational Linguistics,
 arXiv:2004.04906. pages 2335–2345, Florence, Italy. Association for
 Computational Linguistics.
Thomas N. Kipf and Max Welling. 2017. Semi-
 Supervised Classification with Graph Convolutional
 Adam Paszke, Sam Gross, Francisco Massa, Adam
 Networks. In Proceedings of the 5th International
 Lerer, James Bradbury, Gregory Chanan, Trevor
 Conference on Learning Representations, ICLR ’17.
 Killeen, Zeming Lin, Natalia Gimelshein, Luca
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Antiga, Alban Desmaison, Andreas Kopf, Edward
 Kevin Gimpel, Piyush Sharma, and Radu Soricut. Yang, Zachary DeVito, Martin Raison, Alykhan Te-
 2020. Albert: A lite bert for self-supervised learning jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
 of language representations. In International Con- Junjie Bai, and Soumith Chintala. 2019. Pytorch:
 ference on Learning Representations. An imperative style, high-performance deep learn-
 ing library. In Advances in Neural Information Pro-
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. cessing Systems 32, pages 8024–8035. Curran Asso-
 2019. Latent retrieval for weakly supervised open ciates, Inc.
 domain question answering. In Proceedings of the
 57th Annual Meeting of the Association for Com- Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and
 putational Linguistics, pages 6086–6096, Florence, Christopher D. Manning. 2019. Answering complex
 Italy. Association for Computational Linguistics. open-domain questions through iterative query gen-
 eration. In Proceedings of the 2019 Conference on
Chong Li, Kunyang Jia, Dan Shen, C.J. Richard Shi, Empirical Methods in Natural Language Processing
 and Hongxia Yang. 2019. Hierarchical representa- and the 9th International Joint Conference on Natu-
 tion learning for bipartite graphs. In Proceedings of ral Language Processing (EMNLP-IJCNLP), pages
 the Twenty-Eighth International Joint Conference on 2590–2602, Hong Kong, China. Association for
 Artificial Intelligence, IJCAI-19, pages 2873–2879. Computational Linguistics.
 International Joint Conferences on Artificial Intelli-
 gence Organization. Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei
 Li, Weinan Zhang, and Yong Yu. 2019. Dynami-
Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han- cally fused graph network for multi-hop reasoning.
 naneh Hajishirzi. 2019a. Knowledge guided text In Proceedings of the 57th Annual Meeting of the
 retrieval and reading for open domain qa. arXiv Association for Computational Linguistics, pages
 preprint arXiv:1911.03868. 6140–6150, Florence, Italy. Association for Compu-
 tational Linguistics.
Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han-
 naneh Hajishirzi. 2019b. Multi-hop reading compre- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
 hension through question decomposition and rescor- Percy Liang. 2016. SQuAD: 100,000+ questions for
 ing. In Proceedings of the 57th Annual Meeting machine comprehension of text. In Proceedings of
 of the Association for Computational Linguistics, the 2016 Conference on Empirical Methods in Natu-
 pages 6097–6109, Florence, Italy. Association for ral Language Processing, pages 2383–2392, Austin,
 Computational Linguistics. Texas. Association for Computational Linguistics.
Daniil Sorokin and Iryna Gurevych. 2018. Model- ral Networks for NLP, pages 41–45, Florence, Italy.
 ing semantics with gated graph neural networks for Association for Computational Linguistics.
 knowledge base question answering. In Proceedings
 of the 27th International Conference on Computa- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
 tional Linguistics, pages 3306–3317, Santa Fe, New Chaumond, Clement Delangue, Anthony Moi, Pier-
 Mexico, USA. Association for Computational Lin- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
 guistics. towicz, et al. 2019. HuggingFace’s transformers:
 State-of-the-art natural language processing. arXiv
Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi- preprint arXiv:1910.03771.
 aodong He, and Bowen Zhou. 2019. Multi-hop read-
 ing comprehension across multiple documents by Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong
 reasoning over heterogeneous graphs. In Proceed- Long, Chengqi Zhang, and S Yu Philip. 2020. A
 ings of the 57th Annual Meeting of the Association comprehensive survey on graph neural networks.
 for Computational Linguistics, pages 2704–2713, IEEE Transactions on Neural Networks and Learn-
 Florence, Italy. Association for Computational Lin- ing Systems.
 guistics.
 Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.
 Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz End-to-end open-domain question answering with
 Kaiser, and Illia Polosukhin. 2017. Attention is all BERTserini. In Proceedings of the 2019 Confer-
 you need. In I. Guyon, U. V. Luxburg, S. Bengio, ence of the North American Chapter of the Asso-
 H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- ciation for Computational Linguistics (Demonstra-
 nett, editors, Advances in Neural Information Pro- tions), pages 72–77, Minneapolis, Minnesota. Asso-
 cessing Systems 30, pages 5998–6008. Curran Asso- ciation for Computational Linguistics.
 ciates, Inc.
 Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
Petar Veličković, Guillem Cucurull, Arantxa Casanova, William Cohen, Ruslan Salakhutdinov, and Christo-
 Adriana Romero, Pietro Liò, and Yoshua Bengio. pher D. Manning. 2018. HotpotQA: A dataset
 2018. Graph Attention Networks. International for diverse, explainable multi-hop question answer-
 Conference on Learning Representations. ing. In Proceedings of the 2018 Conference on Em-
 pirical Methods in Natural Language Processing,
Ellen M Voorhees and Dawn M Tice. 2000. Building pages 2369–2380, Brussels, Belgium. Association
 a question answering test collection. In Proceedings for Computational Linguistics.
 of the 23rd annual international ACM SIGIR confer-
 ence on Research and development in information Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexan-
 retrieval, pages 200–207. der J Smola, and Le Song. 2018. Variational reason-
 ing for question answering with knowledge graph.
Bingning Wang, Ting Yao, Qi Zhang, Jingfang Xu, In Thirty-Second AAAI Conference on Artificial In-
 Zhixing Tian, Kang Liu, and Jun Zhao. 2019a. Doc- telligence.
 ument gated reader for open-domain question an-
 swering. In Proceedings of the 42nd International Chen Zhao, Chenyan Xiong, Corby Rosset, Xia
 ACM SIGIR Conference on Research and Devel- Song, Paul Bennett, and Saurabh Tiwary. 2020.
 opment in Information Retrieval, SIGIR’19, page Transformer-xh: Multi-evidence reasoning with ex-
 85–94, New York, NY, USA. Association for Com- tra hop attention. In International Conference on
 puting Machinery. Learning Representations.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
 Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
 Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3 :
 Reinforced ranker-reader for open-domain question
 answering. In Thirty-Second AAAI Conference on
 Artificial Intelligence.

Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox-
 iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger,
 Gerald Tesauro, and Murray Campbell. 2018b. Ev-
 idence aggregation for answer re-ranking in open-
 domain question answering. In International Con-
 ference on Learning Representations.

Zhiguo Wang, Yue Zhang, Mo Yu, Wei Zhang, Lin Pan,
 Linfeng Song, Kun Xu, and Yousef El-Kurdi. 2019b.
 Multi-granular text encoding for self-explaining cat-
 egorization. In Proceedings of the 2019 ACL Work-
 shop BlackboxNLP: Analyzing and Interpreting Neu-
You can also read