Memory Augmented Sequential Paragraph Retrieval for Multi-hop Question Answering

Page created by Willie Hopkins
 
CONTINUE READING
Memory Augmented Sequential Paragraph Retrieval for Multi-hop Question Answering
Memory Augmented Sequential Paragraph Retrieval for Multi-hop
                                                                Question Answering
                                                       Nan Shao† , Yiming Cui‡† , Ting Liu‡ , Shijin Wang†§ , Guoping Hu†
                                                   †
                                                     State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China
                                                    ‡
                                                      Research Center for Social Computing and Information Retrieval (SCIR),
                                                                    Harbin Institute of Technology, Harbin, China
                                                                 §
                                                                   iFLYTEK AI Research (Hebei), Langfang, China
                                                           †§
                                                              {nanshao,ymcui,sjwang3,gphu}@iflytek.com
                                                                       ‡
                                                                         {ymcui,tliu}@ir.hit.edu.cn
                                                              Abstract                              Paragraph A: Rand Paul presidential campaign, 2016
                                                                                                    The 2016 presidential campaign of Rand Paul, the junior United
                                                                                                    States Senator from Kentucky, was announced on April 7, 2015 at
                                            Retrieving information from correlative para-           an event at the Galt House in Louisville, Kentucky. … …
                                            graphs or documents to answer open-domain               Paragraph B: Galt House
                                            multi-hop questions is very challenging. To             ……The Galt House is the city's only hotel on the Ohio River. ……
arXiv:2102.03741v1 [cs.CL] 7 Feb 2021

                                            deal with this challenge, most of the exist-            Q: The Ran Paul presidential campaign, 2016 event was held at a
                                            ing works consider paragraphs as nodes in a             hotel on what river?
                                            graph and propose graph-based methods to re-            A: Ohio River
                                            trieve them. However, in this paper, we point
                                            out the intrinsic defect of such methods. In-       Figure 1: An example of open-domain multi-hop ques-
                                            stead, we propose a new architecture that mod-      tion from HotpotQA. The model needs to retrieve evi-
                                            els paragraphs as sequential data and considers     dence paragraphs from entire Wikipedia and derive the
                                            multi-hop information retrieval as a kind of se-    answer.
                                            quence labeling task. Specifically, we design
                                            a rewritable external memory to model the
                                            dependency among paragraphs. Moreover, a            quires multi-hop information retrieval. An exam-
                                            threshold gate mechanism is proposed to elim-       ple from HotpotQA is illustrated in Figure 1. To
                                            inate the distraction of noise paragraphs. We
                                                                                                answer the question “The Rand Paul presidential
                                            evaluate our method on both full wiki and dis-
                                            tractor subtask of HotpotQA, a public textual       campaign, 2016 event was held at a hotel on what
                                            multi-hop QA dataset requiring multi-hop in-        river?”, the retrieval model needs to first identify
                                            formation retrieval. Experiments show that our      Paragraph A as an evidence paragraph according
                                            method achieves significant improvement over        to the similarity of the terms. The retrieval model
                                            the published state-of-the-art method in re-        learns that the event was held at a hotel named
                                            trieval and downstream QA task performance.        “Galt House”, which leads to the next-hop Para-
                                                                                                graph B. Existing neural-based or non-neural based
                                        1   Introduction
                                                                                                methods can not perform well for such questions
                                        Open-domain Question Answering (QA) is a pop-           due to little lexical overlap or semantic relations
                                        ular topic in natural language processing that re-      between the question and Paragraph B.
                                        quires models to answer questions given a large col-       To tackle this challenge, the mainstream studies
                                        lection of text paragraphs (e.g., Wikipedia). Most      model all paragraphs as a graph, where paragraphs
                                        previous works leverage a retrieval model to cal-       are connected if they have the same entity men-
                                        culate semantic similarity between a question and       tions or hyperlink relation (Ding et al., 2019; Zhao
                                        each paragraph to retrieve a set of paragraphs, then    et al., 2020; Asai et al., 2020). For example, Graph-
                                        a reading comprehension model extracts an answer        based Recurrent Retriever (Asai et al., 2020) first
                                        from one of the paragraphs. These pipeline meth-        leverage non-parameterized methods (e.g., TF-IDF
                                        ods work well in single-hop question answering,         or BM25) to retrieve a set of initial paragraphs as
                                        where the answer can be derived from only one           starting points, then a neural retrieval model will
                                        paragraph. However, many real-world questions           determine whether paragraphs that linked to an ini-
                                        produced by users are multi-hop questions that re-      tial paragraph is the next step of reasoning path.
                                        quire reasoning across multiple documents or para-      However, the method suffers several intrinsic draw-
                                        graphs.                                                 backs: (1) These graph-based methods are based on
                                           In this paper, we study the problem of textual       a hypothesis that evident paragraphs should have
                                        multi-hop question answering at scale, which re-        the same entity mentions or linked by a hyperlink
so that they can perform well on bridge questions.          trieves them for solving multi-hop questing
However, for comparison questions (e.g., does A             answering at scale.
and B have the same attribute?), the evidence para-
graphs about A and B are independent. (2) Once             • We implement the framework and propose the
a paragraph is identified as an evidence paragraph,          Gated Memory Flow model. Besides, we in-
graph-based retriever like the Graph-based Recur-            troduce a simple method to find more valuable
rent Retriever (Asai et al., 2020) will continue to          negatives for open-domain multi-hop QA.
determine whether the paragraphs linked from the
                                                           • Our method significantly outperforms all the
evidence paragraph are relevant to the question,
                                                             published methods on HotpotQA by a large
which implies that these models have to assume
                                                             margin. Extensive experiments demonstrate
the relation between paragraphs is directional. (3)
                                                             the effectiveness of our method.
Several graph-based methods have to process all
adjacent paragraphs simultaneously, leading to in-
efficient memory usage.                                2    Related Work
   In this paper, we discard the mainstream prac-
tice. Instead, we propose to treat all candidate       Textual Multi-hop Question Answering. In
paragraphs as sequential data and identify them        contrast to single-hop question answering, multi-
iteratively. The newly proposed framework for          hop question answering requires model reasoning
solving multi-hop information retrieval has several    across multiple documents or paragraphs to derive
desirable advantages: (1) The framework does not       the answer. In recent years, this topic has attracted
assume entity mentions or hyperlinks between two       many researchers attention, and many datasets have
evidence paragraphs so that it can be suitable for     been released (Welbl et al., 2018; Talmor and Be-
both bridge and comparison questions. (2) It is        rant, 2018; Yang et al., 2018; Khot et al., 2020; Xie
natural for the framework to take all paragraphs       et al., 2020). In this paper, we focus our study on
connected to or from a certain paragraph into con-     textual on the textual multi-hop QA dataset, Hot-
sideration. (3) As only one paragraph is located in    potQA.
GPU memory at the same time, the framework is             Methods for solving multi-hop QA can be
memory efficient.                                      roughly divided into two categories: graph-based
   We implement the framework and propose the          methods and query reformulation based meth-
Gated Memory Flow model. Inspired by Neural            ods. Graph-based methods often construct an
Turning Machine (Graves et al., 2014), we lever-       entity graph based on co-reference resolution or
age an external rewritable memory to memorizes         co-occurrence (Dhingra et al., 2018; Song et al.,
information extract from previous paragraphs. The      2018; De Cao et al., 2019). These works show
model iteratively reads a paragraph and the mem-       the GNN frameworks like graph convolution net-
ory to identify whether the current paragraph is an    work (Kipf and Welling, 2017) and graph attention
evidence paragraph. Due to many noise paragraphs       network (Veličković et al., 2018) perform on an
in the candidate set, we design a threshold gating     entity graph can achieve promising results in Wiki-
mechanism to control the writing procedure. In         Hop. Tu (2019) propose to model paragraphs and
practice, we find that different negative examples     candidate answers as different kinds of nodes in
in the training set affect the retrieval performance   a graph and extend an entity graph to a heteroge-
significantly and propose a simple method to find      neous graph. In order to adapt such methods to
more valuable negatives for the retrieval model.       span-based QA tasks like HotpotQA, DFGN (Qiu
   Our method significantly outperforms previous       et al., 2019) leverage a dynamical fusion layer to
methods on HotpotQA under both the full wiki and       combine graph representations and sequential text
the distractor settings. In analysis experiments, we   representations.
demonstrate the effectiveness of our method and           Query reformulation based methods solve multi-
how different settings in the model affect down-       hop QA tasks by implicitly or explicitly reformu-
stream performance.                                    lating the query at different reasoning hop. For
   Our contributions are summarized as follows:        example, QFE (Nishida et al., 2019) update query
                                                       representations at different hop. DecompRC (Min
   • We propose a new framework that treats para-      et al., 2019) decomposes a compositional question
     graphs as sequential data and iteratively re-     into several single-hop questions.
Term-Based Retrieval                           Gated Memory Flow                                            Question Answering

                                                                                                     Cascade Prediction Layer
                                                Value Embeddings                                                                       Answer Type

                                                                                                                                       Answer Span
                                                 Key Embeddings
                                                                                                                                     Supporting Facts
                                                                                Write
                  Query
                                                                  Read       Gate                                                     Gold Paragraph

                     Scandi                                       Pre-trained Model                                      Pre-trained Model

                                                              ……                 ……              Query                          P1    P2     ……
         Skwm        STF-IDF    Shyperlink
                                                        P1                Pi             Pn

                                             Figure 2: Overview of our architecture.

Multi-hop Information Retrieval. Chen (2017)                             3     Method
first propose to leverage the entire Wikipedia para-
graphs to answer open-domain questions. Most ex-                         In this section, we introduce the pipeline we de-
isting open-domain QA methods use a pipeline ap-                         signed for multi-hop QA, the overview of our sys-
proach that includes a retriever model and a reader                      tem is shown in Figure 2. We first leverage a heuris-
model. For open-domain multi-hop QA, graph-                              tic term-based retrieval method to find candidate
based methods are also mainstream practice. These                        paragraphs for the question Q as much as possible.
studies organized paragraphs as nodes in a graph                         Our Gated Memory Flow model takes all candidate
structure and connected them according to hyper-                         paragraphs as input and processes them one by one.
links. Cognitive Graph (Ding et al., 2019) employ                        Finally, paragraphs that gain the highest relevance
a reading comprehension model to predict answer                          score for a question are selected and fed into a neu-
spans and next-hop answer to construct a cogni-                          ral question answering model. The reader model
tive graph. Transformer-XH (Zhao et al., 2020)                           uses a cascade prediction layer to predict all targets
introduce eXtra Hop attention to reasons over a                          in a multi-task way.
paragraph graph and produces the global represen-
                                                                         3.1        Term-Based Retrieval
tation of all paragraphs to predict the answers. Due
to encoding paragraphs independently, this method                        We leverage a heuristic term-based retrieval method
lacks fine-grained evidence paragraphs interaction                       to construct the candidate paragraphs set that con-
and memory-inefficient. Asai (2020) proposes a                           tains all possible paragraphs. Paragraphs in the
graph-based recurrent model to retrieves a reason-                       candidate set come from three sources.
ing chain through a path in the paragraph graph.
As described in Section 1, this method may not                           Key Word Matching. We retrieve all paragraphs
perform well on comparison questions and suffer                          whose title exact match terms in the query. The top
from directional edges caused low recall.                                Nkwm paragraphs with highest TF-IDF score are
                                                                         collected to set Pkwm .

                                                                         TF-IDF. The top Ntf idf paragraphs retrieved by
   There are also several studies propose query                          DrQA’s TF-IDF on the question (Chen et al., 2017).
reformulation based methods for multi-hop infor-
mation retrieval. Multi-step Reasoner (Das et al.,                       Hyperlink. Paragraphs that necessary for de-
2019) employees a retriever interacts with a reader                      rived the answer but do not have lexical overlap
model and reformulates the query in a latent space                       to the question often have hyperlink relation to
for the next retrieval hop. GoldEn Retriever (Qi                         the paragraphs retrieved by key work matching or
et al., 2019) generates several single-hop questions                     TF-IDF. Therefore, for each paragraph Pi in set
at different hop. These methods often perform                            Pkwm and Ptf idf , we extract all the hyperlinked
poorly in result for the error accumulation through                      paragraphs from Pi to construct set Phyperlink . Un-
different hops.                                                          like previous works (Nie et al., 2019; Asai et al.,
2020), both paragraphs that have a hyperlink con-           which defines the memory slots as pairs of vec-
nect to Pi and paragraphs connected from Pi are             tors {(k1 , v1 ), . . . , (klm , vlm )}. In our implemen-
taken into consideration.                                   tation, the key vectors and value vectors are the
   Finally, the three collections are merged into a         same. Only a part of vectors are necessary for iden-
candidate set Pcand .                                       tifying Pt , therefore we use key vectors and query
                                                            vector to address the memory. The addressing and
3.2   Gated Memory Flow                                     reading procedure are similar to soft attention:
After retrieving the candidate set, we propose a
                                                                      pi = Sof tmax(Wq xt · Wk ki )              (5)
neural-based model named Gated Memory Flow                                 X
that accurately selects evidence paragraphs from                      ot =    pi Wv vi                           (6)
the candidate set.                                                            i

Model. As describe above, for a compositional               where pi denotes the readout probability of i-th
question, whether a paragraph contains evidence             memory slot. In our implementation, we extend
information is not only dependent on the query but          the attention readout mechanism to a multi-head
also other paragraphs. The task can be formulated           way (Vaswani et al., 2017):
as:                                                                                        (1)         (h)
                                                                        ot = Concat(ot , . . . , ot )            (7)
         arg max P (Pt ∈ E|Q, P1:t−1 , θ)      (1)
              θ                                              (h)
                                                            ot denotes the output of h-th readout head. Due
where the E denotes evidence set, θ denotes the pa-         to many irrelevant paragraphs in the candidate set,
rameters of the model, Pt is the t-th paragraphs in         we leverage a threshold gating mechanism to con-
candidate set Pcand = {P1 , . . . , Pt , . . . , Pn } and   trol the writing permission. The writing operation
P1:t−1 represents the processed t − 1 paragraphs.           is concatenation for keeping the model concise.
   At the t-th time step, the model takes paragraph                       (
Pt as input and determines whether it is an evidence                        Mt               st < gate
paragraph depending on identified paragraphs. The               Mt+1 =                                     (8)
                                                                            Concat(Mt , xt ) st ≥ gate
question Q and paragraph Pt are concatenated and
fed into a pre-trained model to get the query-aware         The value of scalar gate is a pre-defined hyper-
paragraph representations Ht ∈ Rl×d . To decrease           parameter.
parameters, we compress the matrix into a vec-
tor representation, then we employ the vector as a          Training. The parameters in GMF were updated
query vector to address desired information from            with a binary cross entropy loss function:
the external memory module.                                                        X
                                                                      Lretri = −         log(st )
                                                                                      Pt ∈Ppos
          xt = M eanP ooling(Ht ) ∈ Rd               (2)                                X                        (9)
                                                                                  −              log(1 − st )
  We denote the external memory at step t as                                          Pt ∈Pneg
Mt ∈ Rlm ×d , where lm is number of memory slots
                                                            where Pneg contains eight negative paragraphs sam-
that have stored information at step t. We denote
                                                            pled by our negative sampling strategy for each
the readout vector representation at step t as ot ,
                                                            question. At the beginning of training, the model
which include the missing information to identify
                                                            can not give each paragraph an accurate relevant
the current paragraph. The ot and xt are used to
                                                            score, leading to unexpected behavior of the gate.
identify Dt :
                                                            For this reason, we only activate the gate after the
       ht = tanh(Wo · ot + W · xt ) ∈ Rd             (3)    first epoch of training.
       st = σ(Ws · ht )                              (4)    Training with Harder Negative Examples. For
                                                            each question in the training stage, we pass two
where st represent the relevant score between para-         gold paragraphs and eight negative examples from
graph Pt and question Q.                                    the term-based retrieval result to Gated Memory
  Now we describe the memory module and                     Flow. Intuitively, different negative sampling strate-
read/write procedure in detail. We follow the               gies may influence the training result, but previous
KVMemNN (Miller et al., 2016) architecture,                 works do not explore the effect.
We empirically demonstrate harder negative ex-       tions.
amples lead to better training results and propose
a simple but effective strategy to find them (see                Mp = BiLSTM(C) ∈ Rm×d2                      (10)
Sec. 4.4). Specifically, we first train a BERT-based          Opara =     σ(Wp [Mp [Ps(i) ]; Mp [Pe(i) ]])   (11)
binary classifier to calculate the relevant score be-
                                                                   (i)          (i)
tween the questions and paragraphs. Eight para-         where Ps and Pe denote the start and end to-
graphs are randomly selected from term based re-        ken index of i-th paragraph. Then we can predict
trieval results, combined with two evidence para-       whether the i-th sentence is a supporting fact de-
graphs for each question. Then we employ the            pend on the outputs of evidence paragraph predic-
model as a ranker to select top-8 paragraphs as         tion.
augmented negative examples. We train our GMF
model with these augmented examples.                           Ms = BiLSTM([C, Opara ]) ∈ Rm×d2              (12)
                                                            Osent =      σ(Ws [Ms [Ss(i) ]; Ms [Se(i) ]])    (13)
Inference. At the inference stage, there may be
hundreds of candidate paragraphs for a question.                   (i)         (i)
                                                        where Ss and Se denote the start and end token
Nevertheless, it is unnecessary to put all of them in   index of i-th sentence. The answer span and type
GPU memory besides the parameters of the model          can be predicted in a similar way.
and the current paragraph Pt . To balance inference
speed and memory usage, we load a chunk of para-              Ostart = SubLayer([C, Osup ])                  (14)
graphs into GPU memory at once. After the model                Oend = SubLayer([C, Osup , Ostart ])          (15)
gives all paragraphs a relevant score, we sort them           Otype = SubLayer([C, Osup , Oend ])            (16)
by the score and retrieve the top Nretri paragraphs
with the scores higher than a threshold hd for each     where SubLayer() denotes the similar LSTM layer
question. We take these paragraphs as the input for     and linear projection in eq. 12-13. We compute
the reader model.                                       a cross entropy loss over each outputs and jointly
                                                        optimize them.
3.3   Reader Model
                                                            Lreader = λa Lstart + λa Lend + λp Lpara
There are many existing readers model for multi-                                                             (17)
                                                                                      + λs Lsup + λt Ltype
hop QA, most of which use a graph-based approach.
However, a recent study (Shao et al., 2020) shows       We assign each loss term a coefficient.
self-attention in the pre-trained models can perform
well in multi-hop QA. Therefore, we employ a            4     Experiments
pre-trained model as a reader to solve multi-hop        In this section, we describe the setup and results
reasoning.                                              of our experiments on the HotpotQA dataset. We
   We combine all the paragraphs selected by Gated      compare our method with published state-of-the-art
Memory Flow into context C. For each example,           methods(Asai et al., 2020) in retrieval and down-
we concatenate the question Q and context C as          stream QA performance to demonstrate the superi-
input and fed into a pre-trained model following a      ority of our proposed architecture.
projection layer.
   We follow the similar structure of the prediction    4.1      Experimental Setup
layer from Yang (2018). There are four sub-tasks        Dataset. Our experiments are conducted on Hot-
for the prediction layer: (1) evidence paragraphs       potQA (Yang et al., 2018), a widely used textual
prediction as an auxiliary task; (2) supporting facts   multi-hop question answering dataset. For each
prediction based on evidence paragraphs; (3) the        question, models are required to extract a span of
start and end positions of span answer; (4) answer      text as an answer and corresponding sentences as
type prediction. We use a cascade structure due         supporting facts. Besides, there are two different
to the dependency between different outputs. Five       settings in HotpotQA: distractor setting and full
isomorphic BiLSTMs are stacked layer by layer to        wiki setting. For each question, two evidence para-
obtain different representations for each sub-task.     graphs and eight distractor paragraphs collected by
To predict evidence paragraphs, we take the first       TF-IDF are provided in the distractor setting. For
and last token of i-th paragraph as its representa-     the full wiki setting, each answer and supporting
Full wiki                             Distractor
                                                        Answer         Sup                    Answer          Sup
     Model
                                                       EM           F1      EM      F1      EM           F1         EM          F1
     Baseline (Yang et al., 2018)                      24.7     34.4         5.3   41.0     44.4       58.3        22.0     66.7
     QFE (Nishida et al., 2019)                         -        -            -     -       53.7       68.7        58.8     84.7
     DFGN (Qiu et al., 2019)                            -        -            -     -       55.4       69.2
     Cognitive Graph QA (Ding et al., 2019)            37.6     49.4        23.1   58.5      -          -           -        -
     GoldEn Retriever (Qi et al., 2019)                 -       49.8          -    64.6      -          -           -        -
     SemanticRetrievalMRS (Nie et al., 2019)           46.5     58.8        39.9   71.5      -          -           -        -
     Transformer-XH (Zhao et al., 2020)                50.2     62.4        42.2   71.6      -          -           -        -
     GRR w. BERT (Asai et al., 2020)                   60.5     73.3        49.3   76.1     68.0       81.2        58.6     85.2
     DDRQA w. ALBERT-xxlarge♣ (Zhang et al., 2020)     62.5     75.9        51.0   78.8       -          -           -           -
     Gated Memory Flow                                 63.6     76.5        54.5   80.2     69.6       83.0        64.7     89.0

Table 1: Results on HotpotQA development set in the fullwiki and distractor setting. “-” denotes no results are
available. “♣” indicates the the work is recently presented on arXiv.

facts should be extracted from the entire Wikipedia.                       Retrieval model           Reader Model
HotpotQA contains two question types: bridge                              Nkwm             10      λa , λ p , λ t          1
question and comparison question. The former                              Ntf idf          5       λs                     15
                                                                          h                16      epoch                   4
requires models to reasoning from one evidence to                         d              1024      lr                    1e-5
another. To answer the comparison question, mod-                          Nretri           8       batch                   8
els need to compare two entities described in two                         hd             0.025
                                                                          gate value      0.2
paragraphs.                                                               epoch            2
                                                                          lr             1e-5
Metrics. We evaluate our pipeline system not                              batch            6
only on downstream QA performance but also on
the intermediate retrieval result. We report F1 and           Table 2: List of hyper-parameters used in our model.
EM scores for evaluating QA performance and Sup-
porting Fact F1 (Sup F1) and Supporting Fact EM                          Methods                  AR          PR         P EM
(Sup EM) to evaluate the supporting fact sentences                       Entity-centric IR        63.4        87.3        34.9
retrieval performance. In addition, joint EM and                         Cognitive Graph          76.0        87.6        57.8
                                                                         Semantic Retrieval       77.9        93.2        63.9
F1 scores are used to measure the system of the                          GRR w. BERT              87.0        93.3        72.7
joint performance of the QA model. To measure
                                                                         GMF w. BERT              90.8        94.1        85.7
the intermediate retrieval performance, we follow                        GMF w. RoBERT            91.7        94.7        86.3
Asai (2020) to use the three metrics: Answer Recall
(AR), which evaluate whether the answer is located        Table 3: Comparing our retrieval method with other
in the selected paragraphs, Paragraph Recall (PR),        published methods across Answer Recall, Paragraph
which evaluate if at least one of the evidence para-      Recall and Paragraph EM metrics.
graphs is in retrieved set, Paragraph Exact Match
(P EM), which evaluate whether all evidence para-         ALBERT-xxlarge (Lan et al., 2019) for the reader
graphs are retrieved. The number of selected para-        model. Furthermore, all hyper-parameters in our
graphs for the reader is alternative, hence we also       system are listed in Table 2.
report the precision score to evaluate the retrieval
performance more appropriately.                           4.2         Main Results

Implementation Details. Considering accuracy              Downstream QA. We first evaluate our pipeline
and computational cost, we use different pre-             system on the downstream task on HotpotQA de-
trained models in different components in the sys-        velopment set1 . Table 1 shows the results in full
tem. We report the whole pipeline system re-              wiki setting. The proposed GMF outperforms all
sult in Section 4.2 when the GMF is based on a            published works on every metric by a large margin.
RoBERTa-large model (Liu et al., 2019). In analy-         The GMF achieves 3.1/3.2 points absolute improve-
sis experiments, the GMF is based on a RoBERTa-           ment on Answer EM/F1 scores and 5.2/4.1 on Sup
                                                                1
base model for saving computation cost. We use                      We will also submit our model to blind test set soon.
Retrieval                       QA
                   Settings
                                            AR     PR   P EM      Prec.     Joint EM Joint F1
                   Gated Memory Flow       91.7   94.7    86.3     49.1       39.6           65.8
                     -Bi-direct. Doc.      87.2   95.6    78.8     51.3       37.1           62.3
                     -Threshold Gate       91.2   94.3    85.6     33.8       39.0           65.3
                     -Rewritable Memory    91.4   94.5    85.8     42.5       39.1           65.1
                     -Harder Negatives     91.5   94.5    86.0     39.2       38.4           64.8

      Table 4: Ablation study on the effectiveness of the GMF model on the dev set in the full wiki setting.

EM/F1 scores over the published state-of-the-art             Model        Source      AR        PR    P EM   Prec.
results, which implies the superiority of the whole          baseline     K.W.M.      92.2     94.6   86.4   37.3
pipeline design. We also report our method results           baseline     TF-IDF      91.8     94.5   86.3   39.2
                                                             baseline     Hyperlink   91.3     94.6   86.4   32.2
in the distractor setting. Our GMF also achieves             baseline     BERT        92.4     94.6   86.5   43.4
strong results in this setting.
                                                             GMF          K.W.M.      91.6     94.4   86.1   36.8
Retrieval. The retrieval results in full wiki setting        GMF          TF-IDF      91.5     94.2   86.0   39.2
are presented in Table 3. To fairly compared with            GMF          Hyperlink   91.2     94.5   85.9   34.6
                                                             GMF          BERT        91.7     94.7   86.3   49.1
the published state-of-the-art results, we also re-
port the results of our model based on BERT. Our           Table 5: Effectiveness of training data with different
GMF achieves 3.4 AR and 13.0 P EM over the                 data distribution.
previous state-of-the-art model. The significant im-
provement comes from the proposed architecture
for solving multi-hop retrieval. The new frame-            tently decreases after ablating each component.
work abandons many unnecessary presuppositions
in graph-based methods and does not retrieve a             4.4   Analysis
reasoning path explicitly. Our GMF benefits from           Effectiveness of Negatives. We find that the
the generalization and flexibility of the proposed         training data sampled from different sources will
framework, leading to a high recall of evidence            significantly affect the final retrieval performance.
paragraphs. Note that we achieve a high recall with        In Table 5, we report our GMF retrieval perfor-
very few retrieved paragraphs. The average num-            mance compared with a RoBERTa based re-rank
ber of retrieved paragraphs for each question is less      baseline model, where the training data is sampled
than 4.                                                    from Pkwm , Ptf idf , Phyper and paragraphs selected
                                                           by a pre-trained model.
4.3   Ablation Study                                          The experiments show that the models can
                                                           achieve comparable recall in different settings.
To evaluate the effectiveness of our proposed              However, Comparing data sampled from Pkwm and
method, we perform ablation study on our GMF               Phyperlink , using training data sampled from Ptf idf ,
model. In table 4, we report the ablation results          the GMF can gain about 2.4% and 4.6% precision
in the development set. From the table, we can             scores improvement respectively. The results imply
see that only retrieved paragraphs connected from          that, in the training phase, examples sampled from
paragraphs in Pkwm and Ptf idf will improve the            Ptf idf or Pkwm are more confusing than Phyperlink
precision but significantly hurt the recall scores         for more lexical overlap. Therefore, we further
and downstream performance. We find different              leverage a neural model to rank the candidate para-
components do not influence the recall but lead            graphs and sample the most challenging examples,
to higher precision. In particular, using the exter-       leading to 9.9% absolute precision improvement.
nal rewritable memory can provide 6.6% precision           In addition, our GMF achieves better retrieval per-
improvement, which implies the effectiveness of            formance in each different setting, which implies
modeling the history of retrieval. The gating mech-        the Effectiveness of our proposed model.
anism can help the model not memorizes and re-
trieve irrelevant information. The precision will          Analysis on Candidate Set. We also evaluate
decreases by 16% point after removing the gate. In         how different sizes of the candidate set to influence
addition, harder negatives also improve retrieval          the downstream QA performance. Previous work
performance. The QA performance also consis-               (Asai et al., 2020) uses recall as a metric to measure
Q: Who is the mother of             Walchelin de Ferriers                                           Q: Which is farther west,           Sheridan County
        the king that Walchelin de                                                                              Sheridan County,
                                               John of Berenne
          Ferriers was principal                                                                              Montana or Chandra
               captain to?                                                                                            Taal?

                                                           ……                                                                                               ……
             Walchelin de Ferriers    Marie de Coucy                      Richard I of England                 Sheridan County, Montana Outlook, Montana                   Chandra Taal
                 score: 0.99           score: 0.15                            score: 0.26                             score: 0.99         score: 0.006                      score: 0.99

       Walchelin de Ferrieres (or Walkelin
                                                                     He was the third of five sons of                                                               He was the third of five sons of
         de Ferrers) (died 1201) was a                                                                     Sheridan County is a county located
                                                                      King Henry II of England and            in the U.S. state of Montana.                         King Henry II of England and
       Norman baron and principal captain
                                                                     Duchess Eleanor of Aquitaine.                                                                 Duchess Eleanor of Aquitaine.
         of King Richard I of England.

              Figure 3: Examples of evidence paragraphs prediction in the full wiki setting of HotpotQA .

the quality of retrieval. Intuitively, more retrieved                                                      Top Ntf idf            Recall         Num.              Joint EM              Joint F1
paragraphs will cause a higher recall. However,                                                            Top 5                    93.7          94                   40.1                 66.1
does higher recall leads to better downstream task                                                         Top 10                   95.0          130                  39.8                 65.8
                                                                                                           Top 15                   95.7          167                  39.5                 65.6
performance? We increase the number of para-                                                               Top 20                   96.2          203                  39.1                 65.1
graphs retrieved by the TF-IDF method in the term-
based retrieval phase. As expected, Table 6 shows                                                      Table 6: Comparing the performance with different
more retrieved paragraphs lead to a higher recall.                                                     recall of candidate set.
Notice that the number of candidate paragraphs
|Pcand | significantly increases with Ntf idf . Nev-
                                                                                                       dan County, Montana” and “Chandra Taal”. The
ertheless, Joint EM and Joint F1 scores do not in-
                                                                                                       two evidence paragraphs gain a very high relevant
crease as recall. Because there are too many noise
                                                                                                       score, while others score approximately equal to
paragraphs in the candidate set, some will be re-
                                                                                                       zero. In fact, because the keywords “Sheridan
trieved and misleading the reader model. The anal-
                                                                                                       County, Montana” and “Chandra Taal” appear
ysis experiment suggests that only using recall as
                                                                                                       in the question, such comparison questions are
a metric to evaluate retrieval quality is not enough.
                                                                                                       comparatively easy for not only GMF but also a
We hope the future works introduce additional met-
                                                                                                       RoBERTa based re-rank baseline model. However,
rics (e.g., precision) to evaluate the retrieval model
                                                                                                       such questions may be tricky for graph-based re-
they proposed comprehensively.
                                                                                                       triever due to lack of hyperlink connection or the
                                                                                                       same entity mentions between them. In contrast,
 4.5   Case Study
                                                                                                       our architecture does not require such presupposi-
 We provide two examples to understand how our                                                         tions, leading to better generalization and flexibil-
 model works. In Figure 3, the first question is a                                                     ity.
 bridge question. To answer the question, our model
 first retrieves the paragraph “Walchelin de Ferri-                                                    5     Conclusion
 eres” with high confidence. Then the model takes                                                      In this paper, to tackle the multi-hop information
 paragraph “Marie de Councy” and gives a 0.15                                                          retrieval challenge, we introduce an architecture
 relevance score. The score is lower than the thresh-                                                  that models a set of paragraphs as sequential data
 old value of the gate, so the paragraph will not be                                                   and iteratively identifies them. Specifically, we
 written to the memory. There are two paragraphs                                                       propose Gated Memory Flow to iterative read and
 stored in the memory when the model is trying                                                         memorize reasoning required information without
 to calculate the relevance between the paragraph                                                      noise information interference. We evaluate our
“Richard I England” and the question. The model                                                        method on both full wiki and distractor settings on
 takes the representation of paragraph as input to                                                     the HotpotQA dataset and the method outperforms
 address necessary information from the memory                                                         previous works by a large margin. In the future, we
 module, then the memorized paragraph “Walchelin                                                       will attempt to design a more complicated model
 de Ferriers” is readout.                                                                              to improve retrieval performance and explore more
    The second case is a comparison question. The                                                      about the effect of training data with different data
 question asks to compare the locations of “Sheri-                                                     distribution for multi-hop information retrieval.
References                                                Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
                                                            dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,          Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Richard Socher, and Caiming Xiong. 2020. Learn-           Roberta: A robustly optimized bert pretraining ap-
  ing to retrieve reasoning paths over wikipedia graph      proach. arXiv preprint arXiv:1907.11692.
  for question answering. In International Conference
  on Learning Representations.                            Alexander Miller, Adam Fisch, Jesse Dodge, Amir-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine           Hossein Karimi, Antoine Bordes, and Jason Weston.
  Bordes. 2017. Reading Wikipedia to answer open-           2016. Key-value memory networks for directly read-
  domain questions. In Proceedings of the 55th An-          ing documents. In Proceedings of the 2016 Con-
  nual Meeting of the Association for Computational         ference on Empirical Methods in Natural Language
  Linguistics (Volume 1: Long Papers), pages 1870–          Processing, pages 1400–1409, Austin, Texas. Asso-
  1879, Vancouver, Canada. Association for Computa-         ciation for Computational Linguistics.
  tional Linguistics.
                                                          Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han-
Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,           naneh Hajishirzi. 2019. Multi-hop reading compre-
  and Andrew McCallum. 2019. Multi-step retriever-          hension through question decomposition and rescor-
  reader interaction for scalable open-domain question      ing. In Proceedings of the 57th Annual Meeting
  answering. In International Conference on Learn-          of the Association for Computational Linguistics,
  ing Representations.                                      pages 6097–6109, Florence, Italy. Association for
                                                            Computational Linguistics.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.
  Question answering by reasoning across documents        Yixin Nie, Songhe Wang, and Mohit Bansal. 2019.
  with graph convolutional networks. In Proceed-            Revealing the importance of semantic retrieval for
  ings of the 2019 Conference of the North American         machine reading at scale. In Proceedings of the
  Chapter of the Association for Computational Lin-         2019 Conference on Empirical Methods in Natu-
  guistics: Human Language Technologies, Volume 1           ral Language Processing and the 9th International
  (Long and Short Papers), pages 2306–2317, Min-            Joint Conference on Natural Language Processing
  neapolis, Minnesota. Association for Computational        (EMNLP-IJCNLP), pages 2553–2566, Hong Kong,
  Linguistics.                                              China. Association for Computational Linguistics.
Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-        Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata,
  hen, and Ruslan Salakhutdinov. 2018. Neural mod-          Atsushi Otsuka, Itsumi Saito, Hisako Asano, and
  els for reasoning over multiple mentions using coref-     Junji Tomita. 2019. Answering while summarizing:
  erence. In Proceedings of the 2018 Conference of          Multi-task learning for multi-hop QA with evidence
  the North American Chapter of the Association for         extraction. In Proceedings of the 57th Annual Meet-
  Computational Linguistics: Human Language Tech-           ing of the Association for Computational Linguistics,
  nologies, Volume 2 (Short Papers), pages 42–48,           pages 2335–2345, Florence, Italy. Association for
  New Orleans, Louisiana. Association for Computa-          Computational Linguistics.
  tional Linguistics.
                                                          Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,            Christopher D. Manning. 2019. Answering complex
  and Jie Tang. 2019. Cognitive graph for multi-hop         open-domain questions through iterative query gen-
  reading comprehension at scale. In Proceedings of         eration. In Proceedings of the 2019 Conference on
  the 57th Annual Meeting of the Association for Com-       Empirical Methods in Natural Language Processing
  putational Linguistics, pages 2694–2703, Florence,        and the 9th International Joint Conference on Natu-
  Italy. Association for Computational Linguistics.         ral Language Processing (EMNLP-IJCNLP), pages
                                                            2590–2602, Hong Kong, China. Association for
Alex Graves, Greg Wayne, and Ivo Danihelka.
                                                            Computational Linguistics.
  2014. Neural turing machines. arXiv preprint
  arXiv:1410.5401.                                        Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei
Tushar Khot, Peter Clark, Michal Guerquin, P. Jansen,       Li, Weinan Zhang, and Yong Yu. 2019. Dynami-
  and A. Sabharwal. 2020. Qasc: A dataset for ques-         cally fused graph network for multi-hop reasoning.
  tion answering via sentence composition. In AAAI.         In Proceedings of the 57th Annual Meeting of the
                                                            Association for Computational Linguistics, pages
Thomas N. Kipf and Max Welling. 2017. Semi-                 6140–6150, Florence, Italy. Association for Compu-
  supervised classification with graph convolutional        tational Linguistics.
  networks. In International Conference on Learning
  Representations (ICLR).                                 Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, and
                                                            Guoping Hu. 2020.      Is graph structure neces-
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,              sary for multi-hop reasoning?    arXiv preprint
  Kevin Gimpel, Piyush Sharma, and Radu Soricut.            arXiv:2004.03096.
  2019. Albert: A lite bert for self-supervised learn-
  ing of language representations. arXiv preprint         Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang,
  arXiv:1909.11942.                                         Radu Florian, and Daniel Gildea. 2018. Exploring
graph-structured passage representation for multi-      Transformer-xh: Multi-evidence reasoning with ex-
  hop reading comprehension with graph neural net-        tra hop attention. In International Conference on
  works. arXiv preprint arXiv:1809.02040.                 Learning Representations.

Alon Talmor and Jonathan Berant. 2018. The web
  as a knowledge-base for answering complex ques-
  tions. In Proceedings of the 2018 Conference of the
  North American Chapter of the Association for Com-
  putational Linguistics: Human Language Technolo-
  gies, Volume 1 (Long Papers), pages 641–651, New
  Orleans, Louisiana. Association for Computational
  Linguistics.

Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi-
  aodong He, and Bowen Zhou. 2019. Multi-hop read-
  ing comprehension across multiple documents by
  reasoning over heterogeneous graphs. In Proceed-
  ings of the 57th Annual Meeting of the Association
  for Computational Linguistics, pages 2704–2713,
  Florence, Italy. Association for Computational Lin-
  guistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.

Petar Veličković, Guillem Cucurull, Arantxa Casanova,
  Adriana Romero, Pietro Liò, and Yoshua Bengio.
  2018. Graph attention networks. In International
  Conference on Learning Representations.

Johannes Welbl, Pontus Stenetorp, and Sebastian
  Riedel. 2018. Constructing datasets for multi-hop
  reading comprehension across documents. Transac-
  tions of the Association for Computational Linguis-
  tics, 6:287–302.

Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz-
  abeth Wainwright, Steven Marmorstein, and Peter
  Jansen. 2020. WorldTree v2: A corpus of science-
  domain structured explanations and inference pat-
  terns supporting multi-hop inference. In Proceed-
  ings of the 12th Language Resources and Evaluation
  Conference, pages 5456–5473, Marseille, France.
  European Language Resources Association.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
  William Cohen, Ruslan Salakhutdinov, and Christo-
  pher D. Manning. 2018. HotpotQA: A dataset
  for diverse, explainable multi-hop question answer-
  ing. In Proceedings of the 2018 Conference on Em-
  pirical Methods in Natural Language Processing,
  pages 2369–2380, Brussels, Belgium. Association
  for Computational Linguistics.

Yuyu Zhang, Ping Nie, Arun Ramamurthy, and
  Le Song. 2020. Ddrqa: Dynamic document rerank-
  ing for open-domain multi-hop question answering.
  arXiv preprint arXiv:2009.07465.

Chen Zhao, Chenyan Xiong, Corby Rosset, Xia
  Song, Paul Bennett, and Saurabh Tiwary. 2020.
You can also read