Open-Retrieval Conversational Machine Reading

Page created by Kathleen Frazier
 
CONTINUE READING
Open-Retrieval Conversational Machine Reading

 Yifan Gao Jingjing Li Michael R. Lyu Irwin King
 Department of Computer Science and Engineering,
 The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
 yfgao@cse.cuhk.edu.hk | yifangao95@gmail.com

 Abstract Retrieved Rule Text 1: SBA provides loans to businesses
 - not individuals - so the requirements of eligibility are
 In conversational machine reading, systems based on aspects of the business, not the owners. All
 businesses that are considered for financing under SBA’s
 need to interpret natural language rules, an-
arXiv:2102.08633v1 [cs.CL] 17 Feb 2021

 7(a) loan program must: meet SBA size standards, be
 swer high-level questions such as “May I qual- for-profit, not already have the internal resources (busi-
 ify for VA health care benefits?”, and ask ness or personal) to provide the financing, and be able
 follow-up clarification questions whose an- to demonstrate repayment.
 swer is necessary to answer the original ques- Retrieved Rule Text 2: You’ll need a statement of Na-
 tional Insurance you’ve paid in the UK to get these benefits
 tion. However, existing works assume the - unless you’re claiming Winter Fuel Payments.
 rule text is provided for each user question, Retrieved Rule Text 3: 7(a) loans are the most basic and
 which neglects the essential retrieval step in most used type loan of the Small Business Administration’s
 real scenarios. In this work, we propose and (SBA) business loan programs. It’s name comes from
 investigate an open-retrieval setting of con- section 7(a) of the Small Business Act, which authorizes
 the agency to provide business loans to American small
 versational machine reading. In the open- businesses. The loan program is designed to assist for-
 retrieval setting, the relevant rule texts are profit businesses that are not able to get other financing
 unknown so that a system needs to retrieve from other resources.
 question-relevant evidence from a collection User Scenario: I am a 34 year old man from the United
 of rule texts, and answer users’ high-level States who owns their own business. We are an American
 small business.
 questions according to multiple retrieved rule User Question: Is the 7(a) loan program for me?
 texts in a conversational manner. We pro- Follow-up Q1 : Are you a for-profit business?
 pose M UDERN, a Multi-passage Discourse- Follow-up A1 : Yes.
 aware Entailment Reasoning Network which Follow-up Q2 : Are you able to get financing from other
 extracts conditions in the rule texts through dis- resources?
 Follow-up A2 : No.
 course segmentation, conducts multi-passage Final Answer: Yes. (You can apply the loan.)
 entailment reasoning to answer user questions
 directly, or asks clarification follow-up ques- Figure 1: Open retrieval conversational machine read-
 tions to inquiry more information. On our cre- ing task. The machine answers the user question by
 ated OR-ShARC dataset, M UDERN achieves searching for relevant rule texts, reading retrieved noisy
 the state-of-the-art performance, outperform- rule texts (rule text 2 is irrelevant to the user ques-
 ing existing single-passage conversational ma- tion), interpreting the user scenario, and keeping asking
 chine reading models as well as a new multi- follow-up questions to clarify the user’s background
 passage conversational machine reading base- until it concludes a final answer. Question-relevant re-
 line by a large margin. In addition, we conduct quirements in the rule texts are bold.
 in-depth analyses to provide new insights into
 this new setting and our model.
 provide all their background information needed in
 1 Introduction
 a single turn. In this case, systems may keep asking
 Conversational machine reading aims to answer clarification questions on underspecified require-
 high-level questions that cannot be answered lit- ments until the systems can make a final decision
 erally from the text (Saeidi et al., 2018). Instead, (e.g. qualify or not). This interactive behavior
 questions such as “May I qualify for VA health between users and machines receives attention re-
 care benefits?” require systems to understand the cently (Zhong and Zettlemoyer, 2019; Gao et al.,
 requirements listed in the rule text and evaluate 2020a) and has applications in intelligent assistant
 the user’s satisfaction towards these requirements. systems such as Siri, Alexa, and Google Assistant.
 Since users are unaware of the text, they cannot However, previous study (Saeidi et al., 2018)
simplify this challenging problem by assuming that (EDUs) using a pre-trained discourse segmenta-
the relevant rule text is given before the system tion model (Li et al., 2018). Then it conducts
interacts with users, which is unrealistic in real discourse-aware entailment reasoning according
intelligent assistant systems. In this paper, we con- to the user-provided scenario and dialog history to
sider an Open-Retrieval setting of conversational decide which requirements are satisfied in a weakly
machine reading. In this setting, a system needs to supervised manner. Finally, it directly answers the
retrieve relevant rule texts according to the user’s user question with “Yes/No” or asks a clarification
question, read these rule texts, interact with the follow-up question.
user by asking clarification questions, and finally We compare M UDERN with previous best single-
give the answer. Figure 1 provides an example for passage conversational machine reading models as
this setting. The user asks whether he is suitable well as a new RoBERTa-based multi-passage con-
for the loan program. The machine retrieves three versational machine reading baseline on our created
question-relevant rule texts but only rule text 1 and OR-ShARC dataset. Our results show that M UD -
rule text 3 are truly relevant. Then it understands ERN can not only make the “Yes/No/Inquire” deci-
the requirements listed in the rule text, finds the sion more accurate at each dialog turn but also gen-
user meets “American small business” from the erate better and to-the-point clarification follow-up
user scenario, asks two follow-up questions about questions. In particular, M UDERN outperforms the
“for-profit business” and “not get financing from previous best model D ISCERN (Gao et al., 2020b)
other resources”, and finally it concludes the an- by 8.2% in micro-averaged decision accuracy and
swer “Yes” to the user’s initial question. 11.8% in question generation F1BLEU4 . Moreover,
 There are two major challenges for this new set- we conduct in-depth analyses on model ablation
ting. Firstly, existing conversational machine read- and configuration to provide insights for the OR-
ing datasets such as ShARC (Saeidi et al., 2018) ShARC task. We show that M UDERN can perform
contains incomplete initial questions such as “Can extremely well on rule texts seen in training but
I receive this benefit?” because the annotator is there is still a large room for improvement on un-
given the rule text for question generation in the seen rule texts.
data collection process. But it brings ambiguity The main contributions of this work, which are
in our open-retrieval setting because we do not fundamental to significantly push the state-of-the-
know which benefit does the user asks for. Second, art in open-retrieval conversational machine read-
existing conversational machine reading models ing, can be summarized as follows:
(Zhong and Zettlemoyer, 2019; Gao et al., 2020a,b;
 • We propose a new Open-Retrieval setting of
Lawrence et al., 2019) ask and answer conversa-
 conversational machine reading on natural lan-
tional questions with the single gold rule text but it
 guage rules, and create a dataset OR-ShARC
is not the case for the open-retrieval setting. Mod-
 for this setting.
els need to retrieve multiple relevant rule texts and
 • We develop a new model M UDERN on the
read and reason over the noisy rule texts (e.g. the
 open-retrieval conversational machine reading
retrieved rule text 2 in Figure 1 is irrelevant to the
 that can jointly read and reason over multiple
question).
 rule texts for asking and answering conversa-
 We tackle the first challenge by reformulating tional questions.
incomplete questions to their complete version ac- • We conduct extensive experiments to show
cording to the gold rule text in ShARC dataset, and that M UDERN outperforms existing single-
create an Open-Retrieval version of ShARC named passage models as well as our strong
OR-ShARC. For example, the incomplete question RoBERTa-based multi-passage baseline on
“Can I receive this benefit?” will be reformulated the OR-ShARC dataset.
by specifying the name of the benefit mentioned in
the rule text. Consequently, the ambiguity of the 2 Related Work
initial user question is resolved. Furthermore, we
use TF-IDF to retrieve multiple relevant rule texts, 2.1 Conversational Machine Reading
and propose M UDERN: Multi-passage Discourse- Machine reading comprehension is a long-standing
aware Entailment Reasoning Network for conver- task in natural language processing and has re-
sational machine reading over multiple retrieved ceived a resurgence recently (Liu et al., 2020; Chen
rule texts. M UDERN firstly segments these rule and Li, 2020; Wu and Xu, 2020; Wang et al., 2020).
texts into clause-like elementary discourse units Different from dialog question answering (Choi
et al., 2018; Reddy et al., 2019; Gao et al., 2019b) (Qu et al., 2020) firstly introduces an open-domain
which aims to answer factoid questions for infor- setting for answering factoid conversational ques-
mation gathering in a conversation, conversational tions in QuAC (Choi et al., 2018). They propose
machine reading on natural language rules (Saeidi ORConvQA, which firstly retrieves evidence from
et al., 2018) require machines to interpret a set of a large collection of passages, reformulates the
complex requirements in the rule text, ask clari- question according to the dialog history, and ex-
fication question to fill the information gap, and tracts the most possible span as the answer from
make a final decision as the answer does not liter- retrieved rule texts. However, there are two ma-
ally appear in the rule text. This major difference jor differences between ORConvQA and our work:
motivates a series of work focusing on how to im- 1) They focus on answering open-domain factoid
prove the joint understanding of the dialog and rule conversational questions but our work addresses
text instead of matching the conversational ques- the challenges in answering high-level questions in
tion to the text in conversational question answer- conversational machine reading; 2) Their proposed
ing. Initial methods simply use pretrained models model ORConvQA follows the recent reranking-
like BERT (Devlin et al., 2019) for answer classi- and-extraction approach in open-domain factoid
fication without considering satisfaction states of question answering (Wang et al., 2019) while our
each rule listed in the rule text (Verma et al., 2020; M UDERN aims to answer high-level questions in
Lawrence et al., 2019). Zhong and Zettlemoyer an entailment-driven manner.
(Zhong and Zettlemoyer, 2019) propose to firstly
identify all rule text spans, infer the satisfaction 3 Task Definition
of each rule span according to the textual overlap 3.1 OR-ShARC Setup
with dialog history for decision making. Recent
 Figure 1 depicts the OR-ShARC task. The user
works (Gao et al., 2020a,b) focuses on modeling
 describes the scenario s and posts the initial ques-
the entailment-oriented reasoning process in con-
 tion q. The system needs to search in its knowl-
versational machine reading for decision making.
 edge base (a large collection of rule texts R), and
They parse the rule text into several conditions,
 retrieve relevant rule texts r1 , r2 , ..., rm using the
and explicitly track the entailment state (satisfied
 user scenario and question. Then the conversation
or not) for each condition for the final decision
 starts. At each dialog turn, the input is retrieved
making. Different from all previous work, we con-
 rule texts r1 , r2 , ..., rm , a user scenario s, user ques-
sider a new open-retrieval setting where the ground
 tion q, and previous k (≥ 0) turns dialog history
truth rule text is not provided. Consequently, our
 (q1 , a1 ), ..., (qk , ak ) (if any). OR-ShARC has two
M UDERN retrieves several rule texts and our work
 subtasks: 1) Decision Making: the system needs
focuses on multi-passage conversational machine
 to make a decision among “Yes”, “No, “Inquire”
reading which has never been studied.
 where “Yes” & “No” directly answers the user’s
 initial question while “’inquire” indicates the sys-
2.2 Open Domain Question Answering
 tem still needs to gather further information from
Open-Domain Question Answering is a long- the user. 2) Question Generation: The “inquire”
standing problem in NLP (Voorhees et al., 1999; decision will let the system generate a follow-up
Moldovan et al., 2000; Brill et al., 2002; Ferrucci question on the underspecified information for clar-
et al., 2010). Different from reading comprehen- ification.
sion (Rajpurkar et al., 2016), supporting passages
(evidence) are not provided as inputs in open- 3.2 Evaluation Metrics
domain QA. Instead, all questions are answered For the decision-making subtask, we use macro-
using a collection of documents such as Wikipedia and micro-accuracy of three classes “Yes”, “No”
pages. Most recent open-domain QA systems fo- and “Inquire” as evaluation metrics. For the ques-
cus on answering factoid questions through a two- tion generation subtask, we compute the BLEU
stage retriever-reader approach (Chen et al., 2017; score (Papineni et al., 2002) between the generated
Wang et al., 2019; Gao et al., 2020c): a retriever questions and the ground truth questions. How-
module retrieves dozens of question-relevant pas- ever, different models may have a different set of
sages from a large collection of passages using “Inquire” decisions and therefore predicts a differ-
BM25 (Lee et al., 2019) or dense representations ent set of follow-up questions, which make the
(Karpukhin et al., 2020), then a reader module finds results non-comparable. Previous work either eval-
the answer from the retrieved rule texts. Chen et al. uate the BLEU score only when both the ground
truth decision and the predicted decision are “In- # Rule Texts 651
quire” (Saeidi et al., 2018) or asks the model to # Train Samples 17936
generate a question when the ground truth deci- # Dev Samples 1105
sion is “Inquire” (Gao et al., 2020a). We think # Test Samples 2373
both ways of evaluation are sub-optimal because
the former only evaluates the intersection of “In- Table 1: The statistics of OR-ShARC dataset.
quire” decisions between prediction and ground
truth questions without penalizing the model with
less accurate “Inquire” predictions while the latter
approach requires separate prediction on the oracle evaluation samples. 1 Since the test set of ShARC
question generation task. Here we propose a new is not public, we manually split the train, valida-
metric F1BLEU for question generation. We com- tion, and test set. For the validation and test set,
pute the precision of BLEU for question generation around half of the samples ask questions on rule
when the predicted decision is “Inquire”, texts used in training (seen) while the remaining
 of them contain questions on unseen (new) rule
 PM texts. We design the seen and unseen splits for the
 i=0 BLEU(yi , ŷi )
 precisionBLEU = , (1) validation and test set because it mimics the real us-
 M
 age scenario – users may ask questions about rule
where yi is the predicted question, ŷi is the corre- text which we have training data (dialog history,
sponding ground truth prediction when the model scenario) as well as completely newly added rule
makes an “Inquire” decision (ŷi is not necessarily text. The dataset statistics is shown in Table 1.
a follow-up question), and M is the total number
 Another challenge in our open-retrieval setting
of “Inquire” decisions from the model. The recall
 is that user questions must be complete and
of BLEU is computed in a similar way when the
 meaningful so that the retrieval system can search
ground truth decision is “Inquire”,
 for relevant rule texts, which is not the case for
 PN ShARC dataset. In ShARC dataset, annotators
 i=0 BLEU(yi , ŷi )
 recallBLEU = , (2) may not provide the complete information needs
 N
 in their initial questions because they have already
where N is the total number of “Inquire” decision seen the rule texts. In other words, the information
from the ground truth annotation, and the model need is jointly defined by the initial question
prediction yi here is not necessarily a follow-up and its corresponding rule text in the ShARC
question. We finally calculate F1 by combining the dataset. For example, questions such as “Am I
precision and recall, eligible for this benefit?” have different meanings
 when they are paired with rule texts on different
 2 × precisionBLEU × recallBLEU benefits. Since the gold rule text is unknown in
 F1BLEU = .
 precisionBLEU + recallBLEU our open-retrieval setting, the real information
 (3) need may not be captured successfully. We rewrite
 all questions in the ShARC dataset when the
3.3 OR-ShARC Dataset information need cannot be solely determined by
The ShARC (Saeidi et al., 2018) dataset is designed the initial user questions. In the above case, the
for conversational machine reading. During the an- question “Am I eligible for this benefit?” will be
notation stage, the annotator describes his scenario reformulated into “Am I eligible for UK-based
and asks a high-level question on the given rule text benefits?” if the rule text is about UK-based bene-
which cannot be answered in a single turn. Instead, fits. Our collected dataset OR-ShARC is released
another annotator acts as the machine to ask follow- at https://github.com/Yifan-Gao/open_
up questions whose answer is necessary to answer retrieval_conversational_machine_reading.
the original question. We reuse the user question,
scenario, and constructed dialog including follow-
up questions and answers for our open-retrieval 1
 Different from the open-domain question answering task,
setting. The difference is that we remove the gold we cannot extract a huge collection of passages directly from
rule text for each sample. Instead, we collect all websites such as Wikipedia because every rule text contains
 an inherent logic structure that requires careful extraction.
rule texts used in the ShARC dataset as the knowl- Therefore, we build our collection of passages directly on the
edge base which is accessible for all training and ground truth rule text extracted from ShARC annotators.
Passage 1
 User Scenario Passage 2
 Retriever
 User Question …
 Passage Response
 Reader (Yes|No|Follow-up
 User Scenario Question)
 User Question
 Dialog History
 (if any)

Figure 2: The overall diagram of our proposed M UDERN. M UDERN first retrieves several relevant passages (rule
texts) using the user provided scenario and question (Section 4.1). Then, for every dialog turn, taking the retrieved
rule texts, user question, user scenario, and a history of previous follow-up questions and answers (if any) as inputs,
M UDERNpredict the answer to the question (“Yes” or “No”) or, if needed, generate a follow-up question whose
answer is necessary to answer the original question (Section 4.2).

4 Framework reads the parsed conditions and the user-provided
 information including the user question, scenario,
Our framework answers the user question through and dialog history, and makes the decision among
a retriever-reader process shown in Figure 2: “Yes”, “No”, and “Inquire” (Section 4.2.2). Finally,
 if the decision is “Inquire”, M UDERN generates
 1. Taking the concatenation of user question and
 a follow-up question to clarify the underspecified
 scenario as query, the TF-IDF retriever gets
 condition in the retrieved rule texts (Section 4.2.3).
 m most relevant rule texts for the reader.
 2. Our reader M UDERN initiates a turn-by-turn 4.2.1 Discourse-aware Rule Segmentation
 conversation with the user. At each dialog One challenge in understanding rule texts is that
 turn, taking the retrieved rule texts and the the rule texts have complex logical structures. For
 user-provided information including the user example, the rule text “If a worker has taken more
 question, scenario, and dialog history as in- leave than they’re entitled to, their employer must
 puts, M UDERN predicts “Yes” or “No” as the not take money from their final pay unless it’s been
 answer or asks a follow-up question for clari- agreed beforehand in writing.” actually contains
 fication. two conditions (“If a worker has taken more leave
 than they’re entitled to”, “unless it’s been agreed
4.1 TF-IDF Retriever
 beforehand in writing”) and one outcome (“their
The passage retriever module narrows the search employer must not take money from their final
space so that the reader module can focus on read- pay”). Without successful identification of these
ing possibly relevant passages. We concatenate conditions, the reader cannot ensure the satisfaction
the user question and user scenario as the query, state of each condition and reason out the correct
and use a simple inverted index lookup followed decision. Here we use discourse segmentation in
by term vector model scoring to retrieve relevant discourse parsing to extract these conditions. In
rule texts. Rule texts and queries are compared as discourse parsing (Mann and Thompson, 1988),
TF-IDF weighted bag-of-words vectors. We further texts are firstly split into elementary discourse units
take the local word order into account with bigram (EDUs). Each EDU is a clause-like span in the text.
features which is proven to be effective in (Chen We utilize an off-the-shelf discourse segmenter (Li
et al., 2017). We retrieve 20 rule texts for each et al., 2018) for rule segmentation, which is a pre-
query, which is used by the reader module. trained pointer model that achieves near human-
 level accuracy (92.2% F-score) on the standard
4.2 Reader: M UDERN
 benchmark test set.
We propose M UDERN: Multi-passage Discourse-
aware Entailment Reasoning Network as the reader, 4.2.2 Entailment-driven Decision Making
and it reads and answers questions via a three-step Encoding As shown in Figure 3, we concatenate
approach: Firstly M UDERN uses discourse segmen- the user question, user scenario, a history of previ-
tation to parse each retrieved rule text into individ- ous follow-up questions and answers, and the seg-
ual conditions (Section 4.2.1). Secondly, M UDERN mented EDUs from the retrieved rule texts into a se-
User Question User Scenario Dialog History Rule Text 1 Rule Text 2 Rule Text …

[CLS] Ques. [SEP] [CLS] Scen. [SEP] [CLS] QA! [SEP] [CLS] QA… [SEP] [CLS] EDU!! [CLS] EDU...! [SEP] [CLS] EDU!$ [CLS] EDU...$ [SEP] …

 RoBERTa

 ! " # … ## #… #% %… …

 Inter-sentence Transformer Layers

 B! 
 B" 
 B# 
 B… ; ## ; #… ; #% ; %… …

 Decision Classi=ier Entailment Entailment Entailment Entailment
 Classi7ier Classi7ier Classi7ier Classi7ier

 Yes No Inquire E C N E C N E C N E C N

Figure 3: The decision making module of our proposed M UDERN (Section 4.2.2). The module makes a decision
among “Yes”, “No”, and “Inquire”, as well as predicts the entailment states for every elementary discourse unit
(EDU) among “Entailment” (E), “Contradiction” (C), and “Neutral” (N).

quence, and use RoBERTa (Liu et al., 2019) for en- DICTION or N EUTRAL for each condition. The con-
coding. Additionally, we insert a [CLS] token at dition and user information correspond to premise
the start and append a [SEP] token at the end of all and hypothesis in the textual entailment task. N EU -
individual sequences to get the sentence-level rep- TRAL indicates that the condition has not been
resentations. Each [CLS] represents the sequence mentioned from the user information, which is ex-
that follows it. Finally, we receive the represen- tremely useful in the multi-passage setting because
tations of user question uQ , user scenario uS , H all conditions in the irrelevant rule texts should
turns of follow-up question and answers u1 , ..., uH , have this state. Formally, the i-th condition in the
and segmented conditions (EDUs) from m re- j-th retrieved rule text ẽji goes through a linear
trieved text e11 , e12 , ..., e1N ; ..., em m m
 1 , e2 , ..., eN . For layer to receive its entailment state:
each sample in the dataset, we append as many
complete rule texts as possible under the limit of cji = Wc ẽji + bc ∈ R3 . (4)
the maximum input sequence length of RoBERTa
(denoted as m̃ rule texts). In training, if the ground As there are no gold entailment labels for indi-
truth rule text does not appear in the top-m̃ re- vidual conditions in the ShARC dataset, we heuris-
trieved rule text, we manually insert the gold rule tically generate noisy supervision signals similar to
text right after the dialog history in the input se- (Gao et al., 2020a) as follows: Because each follow-
quence. All these vectorized representations are of up question-answer pair in the dialog history asks
d dimensions (768 for RoBERTa-base). about an individual condition in the rule text, we
 We further utilize an inter-sentence transformer match them to the segmented condition (EDU) in
encoder (Vaswani et al., 2017) to learn the sentence- the rule text which has the minimum edit distance.
level interactions. Taking all sentence-level repre- For conditions in the rule text which are matched
sentations as inputs, the L-layer transformer en- by any follow-up question-answer, we label the
coder makes each condition attend to all the user- entailment state of a condition as Entailment
provided information as well as other conditions. if the answer for its mentioned follow-up ques-
The representations encoded by the inter-sentence tion is Yes, and label the state of this condition
transformer are denoted by adding a tilde over the as Contradiction if the answer is No. The re-
original representations, i.e., uQ becomes ũQ . maining conditions not matched by any follow-up
Entailment Prediction One key step for correct question are labeled as Neutral. Let r indicate
decision-making is the successful determination the correct entailment state. The entailment predic-
of the satisfaction state of every condition in rule tion is supervised by the following cross-entropy
texts. Here we propose to formulate this into a loss using our generated noisy labels, normalized
multi-sentence entailment task: Given a sequence by total number of K conditions in a batch:
of conditions (EDUs) across multiple retrieved rule
 1 X
texts and a sequence of user-provided information, Lentail = − log softmax(cji )r (5)
a system needs to predict E NTAILMENT, C ONTRA - K
 (i,j)
Decision Making Given encoded sentence-level ..., t2,s2 ; ...; tN,1 , ..., tN,sN ] be the encoded vectors
user information ũQ , ũS , ũ1 , ..., ũH and rule con- for tokens from N rule sentences in retrieved rule
ditions ẽ11 , ẽ12 , ..., ẽ1N ; ..., ẽm̃ m̃ m̃
 1 , ẽ2 , ..., ẽN , the deci- texts, we follow the BERTQA approach (Devlin
sion making module predicts “Yes”, “No”, or “In- et al., 2019) to learn a start vector ws ∈ Rd and an
quire” for the current dialog turn. Here we do end vector we ∈ Rd to locate the start and end po-
not distinguish the sequence from the user infor- sitions, under the restriction that the start and end
mation or rule texts, and denote all of them into positions must belong to the same rule sentence:
h1 , h2 , ..., hn . We start by computing a summary
z of the rule texts and user information using self- Span = arg max(ws> tk,i + we> tk,j ) (12)
 i,j,k
attention:
 where i, j denote the start and end positions of the
 αi = wα> hi + bα ∈ R1 (6) selected span, and k is the sentence which the span
 α̃i = softmax(α)i ∈ [0, 1] (7) belongs to. Similar to the training in the decision-
 making module, we include the ground truth rule
 X
 g= α̃i hi ∈ Rd (8)
 i
 text if it does not appear in the top-m̃ retrieved
 texts. The training objective is the log-likelihoods
 z = Wz g + bz ∈ R3 (9)
 of the gold start and end positions. We create the
where αi is the self-attention weight. z ∈ R3 con- noisy labels of spans’ starts and ends by selecting
tains the predicted scores for all three possible deci- the span which has the minimum edit distance with
sions “Yes”, “No” and “Inquire”. Let l indicate the the to-be-asked follow-up question.
correct decision, z is supervised by the following BART-based Question Rewriting Given the to-
cross-entropy loss: be-asked context (extracted span), we find the
 rule text it belongs to, and concatenate the ex-
 Ldec = − log softmax(z)l (10) tracted span with the rule text into one sequence as
 “[CLS] extracted-span [SEP] rule-text [SEP]”.
The overall loss for the entailment-driven decision
 Then we use BART (Lewis et al., 2020), a pre-
making module is the weighted-sum of decision
 trained seq2seq language model, to rewrite the se-
loss and entailment prediction loss:
 quence into a follow-up question. During training,
 Ldm = Ldec + λLentail , (11) the BART model takes the ground truth rule text
 and our created noisy spans as inputs, and the target
where λ is a hyperparameter tuned on the validation is the corresponding follow-up question.
set.
 5 Experiments
4.2.3 Follow-up Question Generation
 5.1 Implementation Details
If the predicted decision is “Inquire”, a follow-up
question is required as its answer is necessary to For the decision making module (Section 4.2.2),
answer the original question. Recent approaches we finetune RoBERTa-base model (Wolf et al.,
in question generation usually identify the to-be- 2020) with Adam (Kingma and Ba, 2015) opti-
asked context first, and generate the complete ques- mizer for 10 epochs with a learning rate of 3e-5
tion (Li et al., 2019; Gao et al., 2019a). Simi- and a batch size of 32. We treat the number of inter-
larly, we design an extraction-and-rewrite approach sentence transformer layers L and the loss weight
which firstly extracts an underspecified span within λ for entailment prediction as hyperparameters,
the rule texts and then rewrites it into a well-formed tune them on the validation set through grid search,
question. and find the best combination is L = 4, λ = 8.
 For the span extraction module (Section 4.2.3),
Span Extraction Similar to the encoding step in we finetune RoBERTa-base model with Adam
the decision-making module (Section 4.2.2), we optimizer for 10 epochs with a learning rate of 5e-5
concatenate the user information and rule texts into and a batch size of 32. For the question rewriting
a sequence, and use RoBERTa for encoding. The module (Section 4.2.3), we finetune BART-base
only difference here is we do not further segment model (Wolf et al., 2020) with Adam optimizer for
sentences in the rule texts into EDUs because we 30 epochs with a learning rate of 1e-4 and a batch
find sometimes the follow-up question contains to- size of 32. We repeat 5 times with different random
kens across multiple EDUs. Let [t1,1 , ..., t1,s1 ; t2,1 , seeds (27, 95, 19, 87, 11) for all experiments on the
Macro Acc. Micro Acc. F1BLEU1 F1BLEU4
 Model
 dev test dev test dev test dev test
 3
 E (Zhong and Zettlemoyer, 2019) 62.3±0.9 61.7±1.9 61.8±0.9 61.4±2.2 29.0±1.2 31.7±0.8 18.1±1.0 22.2±1.1
 EMT (Gao et al., 2020a) 66.5±1.5 64.8±0.4 65.6±1.6 64.3±0.5 36.8±1.1 38.5±0.5 32.9±1.1 30.6±0.4
 D ISCERN (Gao et al., 2020b) 66.7±1.8 67.1±1.2 66.0±1.6 66.7±1.1 36.3±1.9 36.7±1.4 28.4±2.1 28.6±1.2
 MP-RoBERTa 73.1±1.6 70.1±1.4 73.0±1.7 70.4±1.5 45.9±1.1 40.1±1.6 40.0±0.9 34.3±1.5
 M UDERN 78.8±0.6 75.3±0.9 78.4±0.5 75.2±1.0 49.9±0.8 47.1±1.7 42.7±0.8 40.4±1.8

Table 2: Decision making and question generation results on the validation and test set of OR-ShARC. The average
results with standard deviation on 5 random seeds are reported.

 TOP K Train Dev Test Model Yes No Inquire
 3
 1 59.9 53.8 66.9 E (Zhong and Zettlemoyer, 2019) 58.5 61.8 66.5
 2 71.6 67.4 76.8 EMT (Gao et al., 2020a) 56.9 68.6 74.0
 5 83.8 83.4 90.3 D ISCERN (Gao et al., 2020b) 61.7 61.1 77.3
 10 90.4 94.0 94.0 MP-RoBERTa 68.9 80.8 69.5
 20 94.2 96.6 96.6 M UDERN 73.9 80.8 81.7

 M UDERN 86.2 86.4 90.8
 Table 4: Class-wise decision making accuracy among
 “Yes”, “No” and “Inquire” on the validation set of OR-
Table 3: TF-IDF retriever recall at different top K pas- ShARC.
sages on OR-ShARC.

 ble 3. We observe that a simple TF-IDF retriever
validation set and report the average results along
 performs pretty well on OR-ShARC. With only
with their standard deviations. It takes 1∼2 hours
 5 passages, over 90% samples on the test set can
for training on a 40-core server with 4 Nvidia
 contain the ground truth rule text. Since the num-
Titan V GPUs. Code and models are released
 ber of input passages for M UDERN is restricted by
at https://github.com/Yifan-Gao/open_
 the maximum sequence length of RoBERTa-base
retrieval_conversational_machine_reading.
 (for example, if the user scenario and dialog his-
5.2 Baselines tory uses too many tokens, then the RoBERTa-base
 model can only take a relatively fewer number of
Since we are the first to investigate this open-
 rule texts), we also show the actual recall of input
retrieval conversational machine reading setting,
 rule texts for M UDERN in Table 3. On average, the
we replicate several previous works on ShARC
 number of input rule texts for M UDERN is equiva-
conversational machine reading dataset including
 lent to 5 ∼ 6 retrieved rule texts.
E3 (Zhong and Zettlemoyer, 2019), EMT (Gao
et al., 2020a), D ISCERN (Gao et al., 2020b), and 5.4 Reader Results
adapt them to our open-retrieval setting: We train
 Table 2 shows the decision making (macro-
them in the same way as conversational machine
 and micro- accuracy) and question generation
reading using the gold rule text on the OR-ShARC
 (F1BLEU1 , F1BLEU4 ) results of OR-ShARC, and
training set, and use the top-1 retrieved rule text
 we have the following observations:
from our TF-IDF retriever for prediction and eval-
uation. We also propose a multi-passage baseline • Under the open-retrieval setting, there is
MP-RoBERTa which concatenate all sources of a large performance gap for single-passage
inputs including user question, scenario, dialog his- models compared with their performance on
tory, and multiple retrieved rule texts in a sequence, the ShARC dataset. For example, D ISCERN
insert a [CLS] token at first, and make the deci- (Gao et al., 2020b) has 73.2 micro-accuracy
sion prediction using that [CLS] token through a on ShARC dataset while it is 67.1 on OR-
linear layer. MP-RoBERTa shares the same TF- ShARC. This is because the top-1 retrieved
IDF retriever (Section 4.1) and question generation rule text may not be the gold text. Therefore,
module (Section 4.2.3) as our proposed M UDERN. all single-passage baselines (E3 , EMT, D IS -
 CERN ) suffer from the retrieval error.
5.3 Retriever Results
Recall at the different number of top retrieved rule • Reading multiple rule texts has great benefit
texts on the OR-ShARC dataset is shown in Ta- in the open-retrieval setting. Without sophis-
Macro Acc. Micro Acc.
 Model
 dev test dev test
 M UDERN 78.8±0.6 75.3±0.9 78.4±0.5 75.2±1.0
 w/o Discourse Segmentation 78.1±1.7 74.6±1.0 77.7±1.6 74.4±1.0
 w/o Transformer Layers 77.4±1.0 74.2±0.8 77.1±1.0 74.2±0.8
 w/o Entailment Reasoning 73.3±0.9 71.4±1.5 73.2±0.8 71.6±1.5
 w/ D ISCERN (Gao et al., 2020b) 78.5±1.0 74.9±0.6 78.2±0.9 74.8±0.6

 Table 5: Ablation Study of M UDERN on the validation and test set of OR-ShARC.

 ticated model design, MP-RoBERTa outper- transformer layers are complementary to each
 forms all single-passage baselines with com- other.
 plicated architectures by a large margin.
 • The model performs much worse without the
 • With the help of discourse-aware rule segmen-
 entailment supervision (“w/o Entailment Rea-
 tation and entailment-oriented reasoning, our
 soning”), leading to a 3.9% drop on the test set
 proposed M UDERN achieves the best perfor-
 macro-accuracy. It implies the entailment rea-
 mance across all evaluation metrics. We con-
 soning is essential for the conversational ma-
 duct an ablation study in Section 5.5 to demon-
 chine reading which is the key difference be-
 strate the effectiveness of each component.
 tween OR-ShARC and open-retrieval conver-
 We further analyze the class-wise decision- sational question answering (Qu et al., 2020).
making accuracy on the validation and test set of
OR-ShARC in Table 4, and find that M UDERN have • We investigate the D ISCERN-like (Gao et al.,
far better predictions in most prediction classes, es- 2020b) architecture in the multi-passage set-
pecially on the “Inquire” class. This is because ting (“w/ D ISCERN”). We observe that their
M UDERN predicts the decision classes based on proposed entailment vector and making the
the entailment states of all conditions from multiple decision based on the entailment predictions
passages. become less effective in the multi-passage sce-
 nario. Our M UDERN has a simpler decision-
5.5 Ablation Study making module but still achieves slightly bet-
We conduct an ablation study to investigate the ter results.
effectiveness of each component of M UDERN in
the decision-making task. Results are shown in
Table 5 and our observations are as follows: 5.6 Analysis on Seen or Unseen Rule Texts

 • We replace the discourse segmentation with To investigate the model performance on the new,
 simple sentence splitting to examine the neces- unseen rule texts, we conduct a detailed analysis
 sity of using discourse. Results drop 0.7% on on the “seen” and “unseen” split of validation and
 macro-accuracy and 0.8% on micro-accuracy test set. As mentioned in Section 3.3, “seen” de-
 for the test set (“w/o Discourse Segmenta- notes the subset samples with the relevant rule text
 tion”). This shows the discourse segmenta- used in the training phrase while “unseen” denotes
 tion does help for decision predictions, espe- the subset samples with completely new rule texts.
 cially when the rule sentence includes multi- This analysis aims to test the real scenario that
 ple clauses. some newly added rule texts may not have any as-
 sociated user training data (user initial question,
 • We find the inter-sentence transformer lay- user scenario, and dialog history). Results are put
 ers do help the decision prediction process in Table 6. We find that M UDERN performs ex-
 as there is a 1.1% performance drop on the tremely well on the “seen” subset with almost 10%
 test set macro-accuracy if we remove them improvement than the overall evaluation set. How-
 (“w/o Transformer Layers”). This tells us ever, the challenge lies in the “unseen” rule texts –
 that the token-level multi-head self-attention our M UDERN performs almost 10% worse than the
 in the RoBERTa-base model and the sentence- overall performance which suggests this direction
 level self-attention in our added inter-sentence remains further investigations.
Macro Acc. Micro Acc. F1BLEU1 F1BLEU4
 Dataset
 dev test dev test dev test dev test
 Full Dataset 78.8±0.6 75.3±0.9 78.4±0.5 75.2±1.0 49.9±0.8 47.1±1.7 42.7±0.8 40.4±1.8
 Subset (seen) 89.9±0.7 88.1±0.5 90.0±0.8 88.1±0.4 56.2±1.8 62.6±1.5 49.6±1.7 57.8±1.8
 Subset (unseen) 69.4±0.4 66.4±1.6 68.6±0.5 66.0±1.8 32.3±0.7 33.1±2.0 22.0±1.1 24.3±2.1

 Table 6: Results on the seen split and unseen split of OR-ShARC.

6 Conclusion deep bidirectional transformers for language under-
 standing. In Proceedings of the 2019 Conference
In this paper, we propose a new open-retrieval set- of the North American Chapter of the Association
ting of conversational machine reading (CMR) on for Computational Linguistics: Human Language
natural language rules and create the first dataset Technologies, Volume 1 (Long and Short Papers),
OR-ShARC on this new setting. Furthermore, we pages 4171–4186, Minneapolis, Minnesota. Associ-
 ation for Computational Linguistics.
propose a new model M UDERN for open-retrieval
CMR. M UDERN explicitly locates conditions in D. Ferrucci, E. W. Brown, Jennifer Chu-Carroll,
multiple retrieved rule texts, conducts entailment- J. Fan, David Gondek, A. Kalyanpur, Adam Lally,
oriented reasoning to assess the satisfaction states J. William Murdock, Eric Nyberg, J. M. Prager, Nico
 Schlaefer, and Chris Welty. 2010. Building watson:
of each individual condition according to the user- An overview of the deepqa project. AI Mag., 31:59–
provided information, and makes the decision. Re- 79.
sults on OR-ShARC demonstrate that M UDERN
outperforms all single-passage state-of-the-art mod- Yifan Gao, Lidong Bing, Wang Chen, Michael Lyu,
els as well as our strong MP-RoBERTa baseline. and Irwin King. 2019a. Difficulty controllable gen-
 eration of reading comprehension questions. In Pro-
We further conduct in-depth analyses to provide ceedings of the Twenty-Eighth International Joint
new insights into this new setting and our model. Conference on Artificial Intelligence, IJCAI-19,
 pages 4968–4974. International Joint Conferences
 on Artificial Intelligence Organization.
References
 Yifan Gao, Piji Li, Irwin King, and Michael R. Lyu.
Eric Brill, Susan Dumais, and Michele Banko. 2002. 2019b. Interconnected question generation with
 An analysis of the AskMSR question-answering sys- coreference alignment and conversation flow mod-
 tem. In Proceedings of the 2002 Conference on eling. In Proceedings of the 57th Annual Meet-
 Empirical Methods in Natural Language Process- ing of the Association for Computational Linguis-
 ing (EMNLP 2002), pages 257–264. Association for tics, pages 4853–4862, Florence, Italy. Association
 Computational Linguistics. for Computational Linguistics.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Yifan Gao, Chien-Sheng Wu, Shafiq Joty, Caiming
 Bordes. 2017. Reading Wikipedia to answer open- Xiong, Richard Socher, Irwin King, Michael Lyu,
 domain questions. In Proceedings of the 55th An- and Steven C.H. Hoi. 2020a. Explicit memory
 nual Meeting of the Association for Computational tracker with coarse-to-fine reasoning for conversa-
 Linguistics (Volume 1: Long Papers), pages 1870– tional machine reading. In Proceedings of the 58th
 1879, Vancouver, Canada. Association for Computa- Annual Meeting of the Association for Computa-
 tional Linguistics. tional Linguistics, pages 935–945, Online. Associ-
Yongrui Chen and Huiying Li. 2020. Dam: ation for Computational Linguistics.
 Transformer-based relation detection for question
 answering over knowledge base. Knowledge-Based Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty,
 Systems, 201-202:106077. Steven C.H. Hoi, Caiming Xiong, Irwin King, and
 Michael Lyu. 2020b. Discern: Discourse-aware en-
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- tailment reasoning network for conversational ma-
 tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- chine reading. In Proceedings of the 2020 Confer-
 moyer. 2018. QuAC: Question answering in con- ence on Empirical Methods in Natural Language
 text. In Proceedings of the 2018 Conference on Processing (EMNLP), pages 2439–2449, Online. As-
 Empirical Methods in Natural Language Processing, sociation for Computational Linguistics.
 pages 2174–2184, Brussels, Belgium. Association
 for Computational Linguistics. Yifan Gao, Henghui Zhu, Patrick Ng, Cicero Nogueira
 dos Santos, Zhongyu Wang, Feng Nan, Dejiao
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Zhang, Ramesh Nallapati, Andrew O. Arnold, and
 Kristina Toutanova. 2019. BERT: Pre-training of B. Xiang. 2020c. Answering ambiguous questions
through generative evidence fusion and round-trip Yun Liu, Xiaoming Zhang, Feiran Huang, Zhibo Zhou,
 prediction. ArXiv, abs/2011.13137. Zhonghua Zhao, and Zhoujun Li. 2020. Visual ques-
 tion answering via combining inferential attention
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick and semantic space mapping. Knowledge-Based
 Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Systems, 207:106339.
 Wen-tau Yih. 2020. Dense passage retrieval for
 open-domain question answering. In Proceedings of William C Mann and Sandra A Thompson. 1988.
 the 2020 Conference on Empirical Methods in Nat- Rhetorical structure theory: Toward a functional the-
 ural Language Processing (EMNLP), pages 6769– ory of text organization. Text-interdisciplinary Jour-
 6781, Online. Association for Computational Lin- nal for the Study of Discourse, 8(3):243–281.
 guistics.
 Dan Moldovan, Sanda Harabagiu, Marius Pasca, Rada
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Mihalcea, Roxana Girju, Richard Goodrum, and
 method for stochastic optimization. In 3rd Inter- Vasile Rus. 2000. The structure and performance of
 national Conference on Learning Representations, an open-domain question answering system. In Pro-
 ICLR 2015, San Diego, CA, USA, May 7-9, 2015, ceedings of the 38th Annual Meeting of the Associa-
 Conference Track Proceedings. tion for Computational Linguistics, pages 563–570,
 Hong Kong. Association for Computational Linguis-
Carolin Lawrence, Bhushan Kotnis, and Mathias tics.
 Niepert. 2019. Attending to future tokens for bidi-
 rectional sequence generation. In Proceedings of Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
 the 2019 Conference on Empirical Methods in Nat- Jing Zhu. 2002. Bleu: a method for automatic eval-
 ural Language Processing and the 9th International uation of machine translation. In Proceedings of
 Joint Conference on Natural Language Processing the 40th Annual Meeting of the Association for Com-
 (EMNLP-IJCNLP), pages 1–10, Hong Kong, China. putational Linguistics, pages 311–318, Philadelphia,
 Association for Computational Linguistics. Pennsylvania, USA. Association for Computational
 Linguistics.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
 2019. Latent retrieval for weakly supervised open Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce
 domain question answering. In Proceedings of the Croft, and Mohit Iyyer. 2020. Open-retrieval conver-
 57th Annual Meeting of the Association for Com- sational question answering. page 539–548.
 putational Linguistics, pages 6086–6096, Florence,
 Italy. Association for Computational Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
 Percy Liang. 2016. SQuAD: 100,000+ questions for
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- machine comprehension of text. In Proceedings of
 jan Ghazvininejad, Abdelrahman Mohamed, Omer the 2016 Conference on Empirical Methods in Natu-
 Levy, Veselin Stoyanov, and Luke Zettlemoyer. ral Language Processing, pages 2383–2392, Austin,
 2020. BART: Denoising sequence-to-sequence pre- Texas. Association for Computational Linguistics.
 training for natural language generation, translation,
 and comprehension. In Proceedings of the 58th An- Siva Reddy, Danqi Chen, and Christopher D. Manning.
 nual Meeting of the Association for Computational 2019. CoQA: A conversational question answering
 Linguistics, pages 7871–7880, Online. Association challenge. Transactions of the Association for Com-
 for Computational Linguistics. putational Linguistics, 7:249–266.

Jing Li, Aixin Sun, and Shafiq R. Joty. 2018. Seg- Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer
 bot: A generic neural text segmentation model with Singh, Tim Rocktäschel, Mike Sheldon, Guillaume
 pointer network. In Proceedings of the Twenty- Bouchard, and Sebastian Riedel. 2018. Interpreta-
 Seventh International Joint Conference on Artificial tion of natural language rules in conversational ma-
 Intelligence, IJCAI 2018, July 13-19, 2018, Stock- chine reading. In Proceedings of the 2018 Confer-
 holm, Sweden, pages 4166–4172. ijcai.org. ence on Empirical Methods in Natural Language
 Processing, pages 2087–2097, Brussels, Belgium.
Jingjing Li, Yifan Gao, Lidong Bing, Irwin King, and Association for Computational Linguistics.
 Michael R. Lyu. 2019. Improving question gener-
 ation with to the point context. In Proceedings of Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
 the 2019 Conference on Empirical Methods in Nat- Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
 ural Language Processing and the 9th International Kaiser, and Illia Polosukhin. 2017. Attention is all
 Joint Conference on Natural Language Processing you need. In Advances in Neural Information Pro-
 (EMNLP-IJCNLP), pages 3216–3226, Hong Kong, cessing Systems 30, pages 5998–6008. Curran Asso-
 China. Association for Computational Linguistics. ciates, Inc.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Nikhil Verma, Abhishek Sharma, Dhiraj Madan, Dan-
 dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, ish Contractor, Harshit Kumar, and Sachindra Joshi.
 Luke Zettlemoyer, and Veselin Stoyanov. 2019. 2020. Neural conversational QA: Learning to rea-
 Roberta: A robustly optimized bert pretraining ap- son vs exploiting patterns. In Proceedings of the
 proach. arXiv preprint arXiv:1907.11692. 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7263–7269,
 Online. Association for Computational Linguistics.
Ellen M Voorhees et al. 1999. The trec-8 question an-
 swering track report. In Trec, volume 99, pages 77–
 82. Citeseer.

Jiuniu Wang, Wenjia Xu, Xingyu Fu, Yang Wei, Li Jin,
 Ziyan Chen, Guangluan Xu, and Yirong Wu. 2020.
 Srqa: Synthetic reader for factoid question answer-
 ing. Knowledge-Based Systems, 193:105415.

Zhiguo Wang, Yue Zhang, Mo Yu, Wei Zhang, Lin Pan,
 Linfeng Song, Kun Xu, and Yousef El-Kurdi. 2019.
 Multi-granular text encoding for self-explaining cat-
 egorization. In Proceedings of the 2019 ACL Work-
 shop BlackboxNLP: Analyzing and Interpreting Neu-
 ral Networks for NLP, pages 41–45, Florence, Italy.
 Association for Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
 Chaumond, Clement Delangue, Anthony Moi, Pier-
 ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
 icz, Joe Davison, Sam Shleifer, Patrick von Platen,
 Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
 Teven Le Scao, Sylvain Gugger, Mariama Drame,
 Quentin Lhoest, and Alexander Rush. 2020. Trans-
 formers: State-of-the-art natural language process-
 ing. In Proceedings of the 2020 Conference on Em-
 pirical Methods in Natural Language Processing:
 System Demonstrations, pages 38–45, Online. Asso-
 ciation for Computational Linguistics.
Zhijing Wu and Hua Xu. 2020. Improving the ro-
 bustness of machine reading comprehension model
 with hierarchical knowledge and auxiliary unan-
 swerability prediction. Knowledge-Based Systems,
 203:106075.
Victor Zhong and Luke Zettlemoyer. 2019. E3:
 Entailment-driven extracting and editing for conver-
 sational machine reading. In Proceedings of the
 57th Annual Meeting of the Association for Com-
 putational Linguistics, pages 2310–2320, Florence,
 Italy. Association for Computational Linguistics.
You can also read