DAGN: Discourse-Aware Graph Network for Logical Reasoning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DAGN: Discourse-Aware Graph Network for Logical Reasoning
Yinya Huang1∗ Meng Fang2 Yu Cao3 Liwei Wang4 Xiaodan Liang1†
1
Shenzhen Campus of Sun Yat-sen University
2
Tencent Robotics X
3
School of Computer Science, The University of Sydney
4
The Chinese University of Hong Kong
yinya.huang@hotmail, mfang@tencent.com,
ycao8647@uni.sydney.edu.au,
lwwang@cse.cuhk.edu.hk, xdliang328@gmail.com
Abstract A main challenge for the QA models is to un-
cover the logical structures under passages, such
Recent QA with logical reasoning questions re- as identifying claims or hypotheses, or pointing
quires passage-level relations among the sen-
tences. However, current approaches still
out flaws in arguments. To achieve this, the QA
focus on sentence-level relations interacting models should first be aware of logical units, which
among tokens. In this work, we explore aggre- can be sentences or clauses or other meaningful
gating passage-level clues for solving logical text spans, then identify the logical relationships
reasoning QA by using discourse-based infor- between the units. However, the logical structures
mation. We propose a discourse-aware graph are usually hidden and difficult to be extracted, and
network (DAGN) that reasons relying on the most datasets do not provide such logical structure
discourse structure of the texts. The model en-
annotations.
codes discourse information as a graph with
elementary discourse units (EDUs) and dis- An intuitive idea for unwrapping such logical in-
course relations, and learns the discourse- formation is using discourse relations. For instance,
aware features via a graph network for down- as a conjunction, “because” indicates a causal re-
stream QA tasks. Experiments are conducted lationship, whereas “if” indicates a hypothetical
on two logical reasoning QA datasets, Re- relationship. However, such discourse-based infor-
Clor and LogiQA, and our proposed DAGN
mation is seldom considered in logical reasoning
achieves competitive results. The source
code is available at https://github.com/Eleanor- tasks. Modeling logical structures is still lacking in
H/DAGN. logical reasoning tasks, while current opened meth-
ods use contextual pre-trained models (Yu et al.,
1 Introduction 2020). Besides, previous graph-based methods
A variety of QA datasets have promoted the devel- (Ran et al., 2019; Chen et al., 2020a) that construct
opment of reading comprehensions, for instance, entity-based graphs are not suitable for logical rea-
SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang soning tasks because of different reasoning units.
et al., 2018), DROP (Dua et al., 2019), and so In this paper, we propose a new approach to
on. Recently, QA datasets with more complicated solve logical reasoning QA tasks by incorporating
reasoning types, i.e., logical reasoning, are also discourse-based information. First, we construct
introduced, such as ReClor (Yu et al., 2020) and discourse structures. We use discourse relations
LogiQA (Liu et al., 2020). The logical questions from the Penn Discourse TreeBank 2.0 (PDTB
are taken from standardized exams such as GMAT 2.0) (Prasad et al., 2008) as delimiters to split texts
and LSAT, and require QA models to read compli- into elementary discourse units (EDUs). A logic
cated argument passages and identify logical rela- graph is constructed in which EDUs are nodes
tionships therein. For example, selecting a correct and discourse relations are edges. Then, we pro-
assumption that supports an argument, or finding pose a Discourse-Aware Graph Network (DAGN)
out a claim that weakens an argument in a passage. for learning high-level discourse features to rep-
Such logical reasoning is beyond the capability of resent passages.The discourse features are incor-
most of the previous QA models which focus on porated with the contextual token features from
reasoning with entities or numerical keywords. pre-trained language models. With the enhanced
∗ features, DAGN predicts answers to logical ques-
This work was done during Yinya Huang’s internship in
Tencent with L. Wang and M. Fang. tions. Our experiments show that DAGN surpasses
†
Corresponding Author: Xiaodan Liang. current opened methods on two recent logical rea-
5848
Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pages 5848–5855
June 6–11, 2021. ©2021 Association for Computational LinguisticsUpdated EDU Embedding
E2 E3 E8 Prediction
E1 E2 E3 E4 E5 E6 E7 E8
E5 E7 FFN
E1 Graph Reasoning
E4 E6
Concat
Discourse-based Logic Graph
Context EDUs E1 E2 Token Embedding +
A signal…detailed [while] digital system…units [.] With this Pooling
E4 tcontext tquestion toption
E3
}
…disadvantage. [since] there is…signal [,] the duplication…
E5 E6 Context Question Option
Large-scale Pre-trained Model
original [,] which are errors [.]
Input Text
Answer option EDUs E E8 Residual BiGRU
7 Context Question Option
Dialog systems…systems [because] error cannot…signals [.]
(1) (2) (3)
Context: Theoretically, analog systems are superior to digital systems. A signal in a pure analog system can be infinitely detailed, while
digital systems cannot produce signals that are more precise than their digital units. With this theoretical advantage there is a
practical disadvantage. Since there is no limit on the potential detail of the signal, the duplication of an analog representation
allows tiny variations from the original, which are errors.
Question: The statements above, if true, most strongly support which one of the following?
Option Digital systems are the best information systems because error cannot occur in the emission of digital signals.
Figure 1: The architecture of our proposed method with an example below.
soning QA datasets, ReClor and LogiQA. 2.1 Graph Construction
Our main contributions are three-fold:
Our discourse-based logic graph is constructed via
two steps: delimiting text into elementary discourse
• We propose to construct logic graphs from
units (EDUs) and forming the graph using their
texts by using discourse relations as edges
relations as edges, as illustrated in Figure 1(1).
and elementary discourse units as nodes.
Discourse Units Delimitation It is studied that
• We obtain discourse features via graph neural
clause-like text spans delimited by discourse rela-
networks to facilitate logical reasoning in QA
tions can be discourse units that reveal the rhetori-
models.
cal structure of texts (Mann and Thompson, 1988;
Prasad et al., 2008). We further observe that such
• We show the effectiveness of using logic
discourse units are essential units in logical reason-
graph and feature enhancement by noticeable
ing, such as being assumptions or opinions. As the
improvements on two datasets, ReClor and
example shown in Figure 1, the “while” in the con-
LogiQA.
text indicates a comparison between the attributes
of “pure analog system” and that of “digital sys-
2 Method tems”. The “because” in the option provides evi-
Our intuition is to explicitly use discourse-based dence “error cannot occur in the emission of digital
information to mimic the human reasoning process signals” to the claim “digital systems are the best
for logical reasoning questions. The questions are information systems”.
in multiple choices format, which means given a We use PDTB 2.0 (Prasad et al., 2008) to help
triplet (context, question, answer options), models drawing discourse relations. PDTB 2.0 contains
answer the question by selecting the correct an- discourse relations that are manually annotated on
swer option. Our framework is shown in Figure 1. the 1 million Wall Street Journal (WSJ) corpus and
We first construct a discourse-based logic graph are broadly characterized into “Explicit” and “Im-
from the raw text. Then we conduct reasoning via plicit” connectives. The former apparently presents
graph networks to learn and update the discourse- in sentences such as discourse adverbial “instead”
based features, which are incorporated with the or subordinating conjunction “because”, whereas
contextual token embeddings for downstream an- the latter are inferred by annotators between succes-
swer prediction. sive pairs of text spans split by punctuation marks
5849such as “.” or “;”. We simply take all the “Explicit” 2019; Chen et al., 2020a), we also learn graph
connectives as well as common punctuation marks node representations to obtain higher-level features.
to form our discourse delimiter library (details are However, we consider different graph construction
given in Appendix A), with which we delimit the and encoding. Specifically, let G k = (V k , E k ) de-
texts into EDUs. For each data sample, we segment note a graph corresponding to the k-th option in
the context and options, ignoring the question since answer choices. For each node vi ∈ V, the node
the question usually does not carry logical content. embedding vi is initialized with the correspond-
ing EDU embedding ei . Ni = {j|(vj , vi ) ∈ E k }
Discourse Graph Construction We define the
indicates the neighbors of node vi . Wrji is the ad-
discourse-based graphs with EDUs as nodes, the
jacency matrix for one of the two edge types, where
“Explicit” connectives as well as the punctuation
rE indicates graph edges corresponding to the ex-
marks as two types of edges. We assume that
plicit connectives, and rI indicates graph edges
each connective or punctuation mark connects the
corresponding to punctuation marks.
EDUs before and after it. For example, the op-
The model first calculates weight αi for each
tion sentence in Figure 1 is delimited into two
node with a linear transformation and a sigmoid
EDUs, EDU7 =“digital systems are the best infor-
function αi = σ(Wα (vi ) + bα ), then conducts
mation systems” and EDU8 =“error cannot occur
message propagation with the weights:
in the emission of digital signals” by the connec-
tive r =“because”. Then the returned triplets are 1 X
(EDU7 , r, EDU8 ) and (EDU8 , r, EDU7 ). For each ṽi = ( αj Wrji vj ), rji ∈ {rE , rI } (1)
|Ni |
j∈Ni
data sample with the context and multiple answer
options, we separately construct graphs correspond- where ṽi is the message representation of node vi .
ing to each option, with EDUs in the same context αj and vj are the weight and the node embedding
and every single option. The graph for the single of vj respectively.
option k is denoted by G k = (V k , E k ). After the message propagation, the node repre-
2.2 Discourse-Aware Graph Network sentations are updated with the initial node embed-
dings and the message representations by
We present the Discourse-Aware Graph Network
(DAGN) that uses the constructed graph to exploit vi0 = ReLU(Wu vi + ṽi + bu ), (2)
discourse-based information for answering logical
questions. It consists of three main components: an where Wu and bu are weight and bias respectively.
EDU encoding module, a graph reasoning module, The updated node representations vi0 will be used
and an answer prediction module. The former two to enhance the contextual token embedding via
are demonstrated in Figure 1(2), whereas the final summation in corresponding positions. Thus t0l =
component is in Figure 1(3). tl + vn0 , where l ∈ Sn and Sn is the corresponding
token indices set for n-th EDU.
EDU Encoding An EDU span embedding is
obtained from its token embeddings. There Answer Prediction The probabilities of options
are two steps. First, similar to previous works are obtained by feeding the discourse-enhanced
(Yu et al., 2020; Liu et al., 2020), we en- token embeddings into the answer prediction mod-
code such input sequence “ context ule. The model is end-to-end trained using cross
question || option ” into contex- entropy loss. Specifically, the embedding sequence
tual token embeddings with pre-trained language first goes through a layer normalization (Ba et al.,
models, where and are the special to- 2016), then a bidirectional GRU (Cho et al., 2014).
kens for RoBERTa (Liu et al., 2019) model, and The output embeddings are then added to the input
|| denotes concatenation. Second, given the token ones as the residual structure (He et al., 2016). We
embedding sequence {t1 , t2 , ..., tLP
}, the n-th EDU finally obtain the encoded sequence after another
embedding is obtained by en = tl , where Sn
layer normalization on the added embeddings.
l∈Sn We then merge the high-level discourse features
is the set of token indices belonging to n-th EDU.
and the low-level token features. Specifically, the
Graph Reasoning After EDU encoding, DAGN variant-length encoded context sequence, question-
performs reasoning over the discourse graph. In- and-option sequence are pooled via weighted sum-
spired by previous graph-based models (Ran et al., mation wherein the weights are softmax results of
5850Methods Dev Test Test-E Test-H Methods Dev Test
BERT-Large 53.80 49.80 72.00 32.30 BERT-Large 34.10 31.03
XLNet-Large 62.00 56.00 75.70 40.50 RoBERTa-Large 35.02 35.33
RoBERTa-Large 62.60 55.60 75.50 40.00 DAGN 35.48 38.71
DAGN 65.20 58.20 76.14 44.11 DAGN (Aug) 36.87 39.32
DAGN (Aug) 65.80 58.30 75.91 44.46
*
The results are taken from the ReClor paper. Table 2: Experimental results (accuracy %) of DAGN
*
DAGN ranks the 1st on the public ReClor leader- compared with baseline models on LogiQA dataset.
board1 until 17th Nov., 2020 before submitting it to
NAACL. Until now, we find that several better results Methods Dev
appeared in the leaderboard and they are not opened.
DAGN 65.20
Table 1: Experimental results (accuracy %) of DAGN ablation on nodes
compared with baseline models on ReClor dataset. DAGN - clause nodes 64.40
Test-E = Test-EASY, Test-H = Test-HARD. DAGN - sentence nodes 64.40
ablation on edges
DAGN - single edge type 64.80
a linear transformation of the sequence, resulting DAGN - fully connected edges 61.60
in single feature vectors separately. We concate- ablation on graph reasoning
DAGN w/o graph module 64.00
nate them with “” embedding from the back-
bone pre-trained model, and feed the new vector Table 3: Ablation study results (accurcy %) on ReClor
into a two-layer perceptron with a GELU activa- development set.
tion (Hendrycks and Gimpel, 2016) to get the out-
put features for classification. models. As for DAGN, we fine-tune RoBERTa-
Large as the backbone. DAGN (Aug) is a variant
3 Experiments that augments the graph features.
DAGN reaches 58.20% of test accuracy on
We evaluate the performance of DAGN on two log-
ReClor. DAGN (Aug) reaches 58.30%, therein
ical reasoning datasets, ReClor (Yu et al., 2020)
75.91% on EASY subset, and 44.46% on HARD
and LogiQA (Liu et al., 2020), and conduct ab-
subset. Compared with RoBERTa-Large, the
lation study on graph construction and graph net-
improvement on the HARD subset is remark-
work. The implementation details are shown in
ably 4.46%. This indicates that the incorporated
Appendix B.
discourse-based information supplements the short-
coming of the baseline model, and that the dis-
3.1 Datasets
course features are beneficial for such logical rea-
ReClor contains 6,138 questions modified from soning. Besides, DAGN and DAGN (Aug) also
standardized tests such as GMAT and LSAT, which outperform the baseline models on LogiQA, espe-
are split into train / dev / test sets with 4,638 / 500 cially showing 4.01% improvement over RoBERTa-
/ 1,000 samples respectively. The training set and Large on the test set.
the development set are available. The test set is
blind and hold-out, and split into an EASY subset 3.3 Ablation Study
and a HARD subset according to the performance We conduct ablation study on graph construction
of BERT-base model (Devlin et al., 2019). The details as well as the graph reasoning module. The
test results are obtained by submitting the test pre- results are reported in Table 3.
dictions to the leaderboard. LogiQA consists of
8,678 questions that are collected from National Varied Graph Nodes We first use clauses or sen-
Civil Servants Examinations of China and manu- tences in substitution for EDUs as graph nodes. For
ally translated into English by professionals. The clause nodes, we simply remove “Explicit” connec-
dataset is randomly split into train / dev / test sets tives during discourse unit delimitation. So that
with 7,376 / 651 / 651 samples respectively. Both the texts are just delimited by punctuation marks.
datasets contain multiple logical reasoning types. For sentence nodes, we further reduce the delim-
iter library to solely period (“.”). Using the modi-
3.2 Results fied graphs with clause nodes or coarser sentence
nodes, the accuracy of DAGN drops to 64.40%.
The experimental results are shown in Tables 1
This indicates that clause or sentence nodes carry
and 2. Since there is no public method for both
1
datasets, we compare DAGN with the baseline https://bit.ly/2UOQfaS
5851less discourse information and act poorly as logical based on graphs. There are also other methods
reasoning units. such as neuro-symbolic models (Saha et al., 2021)
and adversarial training (Pereira et al., 2020). Our
Varied Graph Edges We make two changes of paper uses a graph-based model. However, for
the edges: (1) modifying the edge type, (2) mod- uncovering logical relations, graph nodes and edges
ifying the edge linking. For edge type, all edges are customized with discourse information.
are regarded as a single type. For edge linking, we Discourse information provides a high-level un-
ignore discourse relations and connect every pair of derstanding of texts and hence is beneficial for
nodes, turning the graph into fully-connected. The many of the natural language tasks, for instance,
resulting accuracies drop to 64.80% and 61.60% text summarization (Cohan et al., 2018; Joty et al.,
respectively. It is proved that in the graph we built, 2019; Xu et al., 2020a; Feng et al., 2020), neural
edges link EDUs in reasonable manners, which machine translation (Voita et al., 2018), and coher-
properly indicates the logical relations. ent text generation (Wang et al., 2020; Bosselut
Ablation on Graph Reasoning We remove the et al., 2018). There are also discourse-based ap-
graph module from DAGN and give a comparison. plications for reading comprehension. DISCERN
This model solely contains an extra prediction mod- (Gao et al., 2020) segments texts into EDUs and
ule than the baseline. The performance on ReClor learns interactive EDU features. Mihaylov and
dev set is between the baseline model and DAGN. Frank (2019) provide additional discourse-based
Therefore, despite the prediction module benefits annotations and encodes them with discourse-
the accuracy, the lack of graph reasoning leads to aware self-attention models. Unlike previous
the absence of discourse features and degenerates works, DAGN first uses discourse relations as
the performance. It demonstrates the necessity of graph edges connecting EDUs for texts, then learns
discourse-based structure in logical reasoning. the discourse features via message passing with
graph neural networks.
4 Related Works
5 Conclusion
Recent datasets for reading comprehension tend to
be more complicated and require models’ capabil- In this paper, we introduce a Discourse-Aware
ity of reasoning. For instance, HotpotQA (Yang Graph Network (DAGN) to addressing logical rea-
et al., 2018), WikiHop (Welbl et al., 2018), Open- soning QA tasks. We first treat elementary dis-
BookQA (Mihaylov et al., 2018), and MultiRC course units (EDUs) that are split by discourse
(Khashabi et al., 2018) require the models to have relations as basic reasoning units. We then build
multi-hop reasoning. DROP (Dua et al., 2019) and discourse-based logic graphs with EDUs as nodes
MA-TACO (Zhou et al., 2019) need the models to and discourse relations as edges. DAGN then learns
have numerical reasoning. WIQA (Tandon et al., the discourse-based features and enhances them
2019) and CosmosQA (Huang et al., 2019) require with contextual token embeddings. DAGN reaches
causal reasoning that the models can understand competitive performances on two recent logical
the counterfactual hypothesis or find out the cause- reasoning datasets ReClor and LogiQA.
effect relationships in events. However, the logical
reasoning datasets (Yu et al., 2020; Liu et al., 2020)
Acknowledgements
require the models to have the logical reasoning The authors would like to thank Wenge Liu,
capability of uncovering the inner logic of texts. Jianheng Tang, Guanlin Li and Wei Wang for
Deep neural networks are used for reasoning- their support and useful discussions. This
driven RC. Evidence-based methods (Madaan et al., work was supported in part by National Nat-
2020; Huang et al., 2020; Rajagopal et al., 2020) ural Science Foundation of China (NSFC) un-
generate explainable evidence from a given context der Grant No.U19A2073 and No.61976233,
as the backup of reasoning. Graph-based methods Guangdong Province Basic and Applied Ba-
(Qiu et al., 2019; De Cao et al., 2019; Cao et al., sic Research (Regional Joint Fund-Key) Grant
2019; Ran et al., 2019; Chen et al., 2020b; Xu No.2019B1515120039, Shenzhen Basic Research
et al., 2020b; Zhang et al., 2020) explicitly model Project (Project No. JCYJ20190807154211365),
the reasoning process with constructed graphs, then Zhijiang Lab’s Open Fund (No. 2020AA3AB14)
learn and update features through message passing and CSIG Young Fellow Support Fund.
5852References Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- deep bidirectional transformers for language under-
ton. 2016. Layer normalization. stat, 1050:21. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Antoine Bosselut, Asli Celikyilmaz, Xiaodong He, Computational Linguistics: Human Language Tech-
Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. nologies, Volume 1 (Long and Short Papers), pages
Discourse-aware neural rewards for coherent text 4171–4186.
generation. In Proceedings of the 2018 Conference
of the North American Chapter of the Association Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
for Computational Linguistics: Human Language Stanovsky, Sameer Singh, and Matt Gardner. 2019.
Technologies, Volume 1 (Long Papers), pages 173– Drop: A reading comprehension benchmark requir-
184. ing discrete reasoning over paragraphs. In Proceed-
ings of the 2019 Conference of the North American
Yu Cao, Meng Fang, and Dacheng Tao. 2019. Bag: Chapter of the Association for Computational Lin-
Bi-directional attention entity graph convolutional guistics: Human Language Technologies, Volume 1
network for multi-hop reasoning question answering. (Long and Short Papers), pages 2368–2378.
In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computa- Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin-
tional Linguistics: Human Language Technologies, wei Geng, and Ting Liu. 2020. Dialogue
Volume 1 (Long and Short Papers), pages 357–362. discourse-aware graph convolutional networks for
abstractive meeting summarization. arXiv preprint
Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi- arXiv:2012.03502.
aochuan, Yuyu Zhang, Le Song, Taifeng Wang,
Yuan Qi, and Wei Chu. 2020a. Question directed
Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty,
graph attention network for numerical reasoning
Steven CH Hoi, Caiming Xiong, Irwin King, and
over text. In Proceedings of the 2020 Conference on
Michael Lyu. 2020. Discern: Discourse-aware en-
Empirical Methods in Natural Language Processing
tailment reasoning network for conversational ma-
(EMNLP), pages 6759–6768.
chine reading. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language
Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi-
Processing (EMNLP), pages 2439–2449.
aochuan, Yuyu Zhang, Le Song, Taifeng Wang,
Yuan Qi, and Wei Chu. 2020b. Question directed
graph attention network for numerical reasoning Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
over text. In Proceedings of the 2020 Conference Sun. 2016. Deep residual learning for image recog-
on Empirical Methods in Natural Language Process- nition. In Proceedings of the IEEE conference on
ing (EMNLP), pages 6759–6768, Online. Associa- computer vision and pattern recognition, pages 770–
tion for Computational Linguistics. 778.
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
danau, and Yoshua Bengio. 2014. On the properties sian error linear units (gelus). arXiv preprint
of neural machine translation: Encoder–decoder ap- arXiv:1606.08415.
proaches. In Proceedings of SSST-8, Eighth Work-
shop on Syntax, Semantics and Structure in Statisti- Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
cal Translation, pages 103–111. Yejin Choi. 2019. Cosmos qa: Machine reading
comprehension with contextual commonsense rea-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, soning. In Proceedings of the 2019 Conference on
Trung Bui, Seokhwan Kim, Walter Chang, and Na- Empirical Methods in Natural Language Processing
zli Goharian. 2018. A discourse-aware attention and the 9th International Joint Conference on Natu-
model for abstractive summarization of long docu- ral Language Processing (EMNLP-IJCNLP), pages
ments. In Proceedings of the 2018 Conference of 2391–2401.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Yinya Huang, Meng Fang, Xunlin Zhan, Qingxing
nologies, Volume 2 (Short Papers), pages 615–621, Cao, Xiaodan Liang, and Liang Lin. 2020. Rem-
New Orleans, Louisiana. Association for Computa- net: Recursive erasure memory network for com-
tional Linguistics. monsense evidence refinement. arXiv preprint
arXiv:2012.13185.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.
Question answering by reasoning across documents Shafiq Joty, Giuseppe Carenini, Raymond Ng, and
with graph convolutional networks. In Proceed- Gabriel Murray. 2019. Discourse analysis and its ap-
ings of the 2019 Conference of the North American plications. In Proceedings of the 57th Annual Meet-
Chapter of the Association for Computational Lin- ing of the Association for Computational Linguistics:
guistics: Human Language Technologies, Volume 1 Tutorial Abstracts, pages 12–17, Florence, Italy. As-
(Long and Short Papers), pages 2306–2317. sociation for Computational Linguistics.
5853Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Association for Computational Linguistics, pages
Shyam Upadhyay, and Dan Roth. 2018. Looking be- 6140–6150, Florence, Italy. Association for Compu-
yond the surface: A challenge set for reading com- tational Linguistics.
prehension over multiple sentences. In Proceedings
of the 2018 Conference of the North American Chap- Dheeraj Rajagopal, Niket Tandon, Peter Clark, Bha-
ter of the Association for Computational Linguistics: vana Dalvi, and Eduard Hovy. 2020. What-if i ask
Human Language Technologies, Volume 1 (Long Pa- you to explain: Explaining the effects of perturba-
pers), pages 252–262. tions in procedural text. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
Diederik P Kingma and Jimmy Ba. 2015. Adam: guage Processing: Findings, pages 3345–3355.
A method for stochastic optimization. In ICLR
(Poster). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions for
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, machine comprehension of text. In Proceedings of
Yile Wang, and Yue Zhang. 2020. Logiqa: A chal- the 2016 Conference on Empirical Methods in Natu-
lenge dataset for machine reading comprehension ral Language Processing, pages 2383–2392.
with logical reasoning. IJCAI 2020.
Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Liu. 2019. Numnet: Machine reading comprehen-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, sion with numerical reasoning. In Proceedings of
Luke Zettlemoyer, and Veselin Stoyanov. 2019. the 2019 Conference on Empirical Methods in Nat-
Roberta: A robustly optimized bert pretraining ap- ural Language Processing and the 9th International
proach. arXiv preprint arXiv:1907.11692. Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2474–2484.
Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Ab-
hilasha Ravichander, Eduard Hovy, and Shrimai Amrita Saha, Shafiq Joty, and Steven CH Hoi. 2021.
Prabhumoye. 2020. Eigen: Event influence gen- Weakly supervised neuro-symbolic module net-
eration using pre-trained language models. arXiv works for numerical reasoning. arXiv preprint
preprint arXiv:2010.11764. arXiv:2101.11802.
William C Mann and Sandra A Thompson. 1988.
Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Pe-
Rhetorical structure theory: Toward a functional the-
ter Clark, and Antoine Bosselut. 2019. Wiqa: A
ory of text organization. Text, 8(3):243–281.
dataset for “what if...” reasoning over procedural
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish text. In Proceedings of the 2019 Conference on
Sabharwal. 2018. Can a suit of armor conduct elec- Empirical Methods in Natural Language Processing
tricity? a new dataset for open book question an- and the 9th International Joint Conference on Natu-
swering. In Proceedings of the 2018 Conference on ral Language Processing (EMNLP-IJCNLP), pages
Empirical Methods in Natural Language Processing, 6078–6087.
pages 2381–2391.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Todor Mihaylov and Anette Frank. 2019. Discourse- Titov. 2018. Context-aware neural machine trans-
aware semantic self-attention for narrative reading lation learns anaphora resolution. In Proceedings
comprehension. In Proceedings of the 2019 Con- of the 56th Annual Meeting of the Association for
ference on Empirical Methods in Natural Language Computational Linguistics (Volume 1: Long Papers),
Processing and the 9th International Joint Confer- pages 1264–1274.
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 2541–2552, Hong Kong, China. As- Wei Wang, Piji Li, and Hai-Tao Zheng. 2020. Con-
sociation for Computational Linguistics. sistency and coherency enhanced story generation.
arXiv preprint arXiv:2010.08822.
Lis Pereira, Xiaodong Liu, Fei Cheng, Masayuki Asa-
hara, and Ichiro Kobayashi. 2020. Adversarial train- Johannes Welbl, Pontus Stenetorp, and Sebastian
ing for commonsense inference. In Proceedings of Riedel. 2018. Constructing datasets for multi-hop
the 5th Workshop on Representation Learning for reading comprehension across documents. Transac-
NLP, pages 55–60, Online. Association for Compu- tions of the Association for Computational Linguis-
tational Linguistics. tics, 6:287–302.
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.
sakaki, Livio Robaldo, Aravind K Joshi, and Bon- 2020a. Discourse-aware neural extractive text sum-
nie L Webber. 2008. The penn discourse treebank marization. In Proceedings of the 58th Annual Meet-
2.0. In LREC. Citeseer. ing of the Association for Computational Linguistics,
pages 5021–5031, Online. Association for Computa-
Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei tional Linguistics.
Li, Weinan Zhang, and Yong Yu. 2019. Dynami-
cally fused graph network for multi-hop reasoning. Yunqiu Xu, Meng Fang, Ling Chen, Yali Du,
In Proceedings of the 57th Annual Meeting of the Joey Tianyi Zhou, and Chengqi Zhang. 2020b. Deep
5854reinforcement learning with stacked hierarchical at- A Discourse Delimiter Library
tention for text-based games. In Advances in Neural
Information Processing Systems, volume 33, pages Our discourse delimiter library consists of two
16495–16507. Curran Associates, Inc. parts, the “Explicit” connectives annotated in Penn
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Discourse TreeBank 2.0 (DPTB 2.0) (Prasad et al.,
William Cohen, Ruslan Salakhutdinov, and Christo- 2008), as well as a set of punctuation marks. The
pher D Manning. 2018. Hotpotqa: A dataset for overall discourse delimiters used in our method are
diverse, explainable multi-hop question answering. presented in Table 4.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages Explicit Connectives
2369–2380. ’once’, ’although’, ’though’, ’but’, ’because’,
’nevertheless’, ’before’, ’for example’, ’until’, ’if’,
Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi ’previously’, ’when’, ’and’, ’so’, ’then’, ’while’, ’as long
Feng. 2020. Reclor: A reading comprehension as’, ’however’, ’also’, ’after’, ’separately’, ’still’, ’so that’,
dataset requiring logical reasoning. In ICLR 2020 ’or’, ’moreover’, ’in addition’, ’instead’, ’on the other
: Eighth International Conference on Learning Rep- hand’, ’as’, ’for instance’, ’nonetheless’, ’unless’,
resentations. ’meanwhile’, ’yet’, ’since’, ’rather’, ’in fact’, ’indeed’,
’later’, ’ultimately’, ’as a result’, ’either or’, ’therefore’,
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan ’in turn’, ’thus’, ’in particular’, ’further’, ’afterward’,
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to- ’next’, ’similarly’, ’besides’, ’if and when’, ’nor’,
tree learning for solving math word problems. In ’alternatively’, ’whereas’, ’overall’, ’by comparison’,
’till’, ’in contrast’, ’finally’, ’otherwise’, ’as if’, ’thereby’,
Proceedings of the 58th Annual Meeting of the Asso-
’now that’, ’before and after’, ’additionally’, ’meantime’,
ciation for Computational Linguistics, pages 3928– ’by contrast’, ’if then’, ’likewise’, ’in the end’,
3937. ’regardless’, ’thereafter’, ’earlier’, ’in other words’, ’as
soon as’, ’except’, ’in short’, ’neither nor’, ’furthermore’,
Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan ’lest’, ’as though’, ’specifically’, ’conversely’,
Roth. 2019. “going on a vacation” takes longer ’consequently’, ’as well’, ’much as’, ’plus’, ’and’,
than “going for a walk”: A study of temporal com- ’hence’, ’by then’, ’accordingly’, ’on the contrary’,
monsense understanding. In Proceedings of the ’simultaneously’, ’for’, ’in sum’, ’when and if’, ’insofar
2019 Conference on Empirical Methods in Natu- as’, ’else’, ’as an alternative’, ’on the one hand on the
ral Language Processing and the 9th International other hand’
Joint Conference on Natural Language Processing Punctuation Marks
(EMNLP-IJCNLP), pages 3363–3369, Hong Kong, ’.’, ’,’, ’;’, ’:’
China. Association for Computational Linguistics.
Table 4: The discourse delimiter library in our imple-
mentation.
B Implementation Details
We fine-tune RoBERTa-Large (Liu et al., 2019)
as the backbone pre-trained language model for
DGAN, which contains 24 hidden layers with hid-
den size 1024. The overall model is end-to-end
trained and updated by Adam (Kingma and Ba,
2015) optimizer with an overall learning rate of 5e-
6 and a weight decay of 0.01. The overall dropout
rate is 0.1. The maximum sequence length is 256.
We tune the model on the dev set to obtain the best
iteration steps of graph reasoning, which is 2 for
ReClor data, and 3 for LogiQA data. The model
is trained for 10 epochs with a batch size of 16 on
Nvidia Tesla V100 GPU.
For the answer prediction module, the hidden
size of GRU is the same as the token embeddings in
the pre-trained language model, which is 1024. The
two-layer perceptron first projects the concatenated
vectors with a hidden size of 1024 × 3 to 1024,
then project 1024 to 1.
5855You can also read