DAGN: Discourse-Aware Graph Network for Logical Reasoning

Page created by Herbert Baker
 
CONTINUE READING
DAGN: Discourse-Aware Graph Network for Logical Reasoning
         Yinya Huang1∗ Meng Fang2 Yu Cao3 Liwei Wang4 Xiaodan Liang1†
                        1
                          Shenzhen Campus of Sun Yat-sen University
                                     2
                                       Tencent Robotics X
                   3
                     School of Computer Science, The University of Sydney
                            4
                              The Chinese University of Hong Kong
                  yinya.huang@hotmail, mfang@tencent.com,
                           ycao8647@uni.sydney.edu.au,
               lwwang@cse.cuhk.edu.hk, xdliang328@gmail.com

                      Abstract                                   A main challenge for the QA models is to un-
                                                              cover the logical structures under passages, such
    Recent QA with logical reasoning questions re-            as identifying claims or hypotheses, or pointing
    quires passage-level relations among the sen-
    tences. However, current approaches still
                                                              out flaws in arguments. To achieve this, the QA
    focus on sentence-level relations interacting             models should first be aware of logical units, which
    among tokens. In this work, we explore aggre-             can be sentences or clauses or other meaningful
    gating passage-level clues for solving logical            text spans, then identify the logical relationships
    reasoning QA by using discourse-based infor-              between the units. However, the logical structures
    mation. We propose a discourse-aware graph                are usually hidden and difficult to be extracted, and
    network (DAGN) that reasons relying on the                most datasets do not provide such logical structure
    discourse structure of the texts. The model en-
                                                              annotations.
    codes discourse information as a graph with
    elementary discourse units (EDUs) and dis-                   An intuitive idea for unwrapping such logical in-
    course relations, and learns the discourse-               formation is using discourse relations. For instance,
    aware features via a graph network for down-              as a conjunction, “because” indicates a causal re-
    stream QA tasks. Experiments are conducted                lationship, whereas “if” indicates a hypothetical
    on two logical reasoning QA datasets, Re-                 relationship. However, such discourse-based infor-
    Clor and LogiQA, and our proposed DAGN
                                                              mation is seldom considered in logical reasoning
    achieves competitive results. The source
    code is available at https://github.com/Eleanor-          tasks. Modeling logical structures is still lacking in
    H/DAGN.                                                   logical reasoning tasks, while current opened meth-
                                                              ods use contextual pre-trained models (Yu et al.,
1 Introduction                                                2020). Besides, previous graph-based methods
A variety of QA datasets have promoted the devel-             (Ran  et al., 2019; Chen et al., 2020a) that construct
opment of reading comprehensions, for instance, entity-based graphs are not suitable for logical rea-
SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang                soning tasks because of different reasoning units.
et al., 2018), DROP (Dua et al., 2019), and so                   In this paper, we propose a new approach to
on. Recently, QA datasets with more complicated               solve  logical reasoning QA tasks by incorporating
reasoning types, i.e., logical reasoning, are also            discourse-based information. First, we construct
introduced, such as ReClor (Yu et al., 2020) and              discourse structures. We use discourse relations
LogiQA (Liu et al., 2020). The logical questions              from the Penn Discourse TreeBank 2.0 (PDTB
are taken from standardized exams such as GMAT                2.0) (Prasad et al., 2008) as delimiters to split texts
and LSAT, and require QA models to read compli- into elementary discourse units (EDUs). A logic
cated argument passages and identify logical rela- graph is constructed in which EDUs are nodes
tionships therein. For example, selecting a correct           and discourse relations are edges. Then, we pro-
assumption that supports an argument, or finding              pose a Discourse-Aware Graph Network (DAGN)
out a claim that weakens an argument in a passage. for learning high-level discourse features to rep-
Such logical reasoning is beyond the capability of            resent passages.The discourse features are incor-
most of the previous QA models which focus on                 porated with the contextual token features from
reasoning with entities or numerical keywords.                pre-trained language models. With the enhanced
    ∗                                                         features, DAGN predicts answers to logical ques-
      This work was done during Yinya Huang’s internship in
Tencent with L. Wang and M. Fang.                             tions. Our experiments show that DAGN surpasses
    †
      Corresponding Author: Xiaodan Liang.                    current opened methods on two recent logical rea-
                                                           5848
                        Proceedings of the 2021 Conference of the North American Chapter of the
              Association for Computational Linguistics: Human Language Technologies, pages 5848–5855
                           June 6–11, 2021. ©2021 Association for Computational Linguistics
Updated EDU Embedding
               E2             E3              E8                                                                                                  Prediction
                                                                            E1 E2 E3 E4 E5 E6                      E7 E8

                                        E5                 E7                                                                                       FFN
     E1                                                                                   Graph Reasoning
                         E4                        E6
                                                                                                                                                   Concat
                                                                                Discourse-based Logic Graph
  Context EDUs E1                              E2                     Token Embedding                                               +
  A signal…detailed [while] digital system…units [.] With this                                                                                     Pooling
                                             E4                                 tcontext        tquestion   toption   
                    E3

                                                                                                                                                 }
   …disadvantage. [since] there is…signal [,] the duplication…
          E5                       E6                                                                                                    Context Question Option
                                                                                Large-scale Pre-trained Model
   original [,] which are errors [.]
                                                                      Input Text
   Answer option EDUs E                                E8                                                                                    Residual BiGRU
                          7                                                  Context          Question Option 
   Dialog systems…systems [because] error cannot…signals [.]
                                    (1)                                                         (2)                                                 (3)

          Context:       Theoretically, analog systems are superior to digital systems. A signal in a pure analog system can be infinitely detailed, while
                         digital systems cannot produce signals that are more precise than their digital units. With this theoretical advantage there is a
                         practical disadvantage. Since there is no limit on the potential detail of the signal, the duplication of an analog representation
                         allows tiny variations from the original, which are errors.
          Question:       The statements above, if true, most strongly support which one of the following?
           Option         Digital systems are the best information systems because error cannot occur in the emission of digital signals.

                              Figure 1: The architecture of our proposed method with an example below.

soning QA datasets, ReClor and LogiQA.                                                   2.1    Graph Construction
  Our main contributions are three-fold:
                                                                                         Our discourse-based logic graph is constructed via
                                                                                         two steps: delimiting text into elementary discourse
   • We propose to construct logic graphs from
                                                                                         units (EDUs) and forming the graph using their
     texts by using discourse relations as edges
                                                                                         relations as edges, as illustrated in Figure 1(1).
     and elementary discourse units as nodes.
                                                      Discourse Units Delimitation It is studied that
   • We obtain discourse features via graph neural
                                                      clause-like text spans delimited by discourse rela-
     networks to facilitate logical reasoning in QA
                                                      tions can be discourse units that reveal the rhetori-
     models.
                                                      cal structure of texts (Mann and Thompson, 1988;
                                                      Prasad et al., 2008). We further observe that such
    • We show the effectiveness of using logic
                                                      discourse units are essential units in logical reason-
      graph and feature enhancement by noticeable
                                                      ing, such as being assumptions or opinions. As the
      improvements on two datasets, ReClor and
                                                      example shown in Figure 1, the “while” in the con-
      LogiQA.
                                                      text indicates a comparison between the attributes
                                                      of “pure analog system” and that of “digital sys-
2 Method                                              tems”. The “because” in the option provides evi-
Our intuition is to explicitly use discourse-based    dence “error cannot occur in the emission of digital
information to mimic the human reasoning process      signals” to the claim “digital systems are the best
for logical reasoning questions. The questions are    information systems”.
in multiple choices format, which means given a          We use PDTB 2.0 (Prasad et al., 2008) to help
triplet (context, question, answer options), models   drawing discourse relations. PDTB 2.0 contains
answer the question by selecting the correct an- discourse relations that are manually annotated on
swer option. Our framework is shown in Figure 1. the 1 million Wall Street Journal (WSJ) corpus and
We first construct a discourse-based logic graph      are broadly characterized into “Explicit” and “Im-
from the raw text. Then we conduct reasoning via      plicit” connectives. The former apparently presents
graph networks to learn and update the discourse- in sentences such as discourse adverbial “instead”
based features, which are incorporated with the       or subordinating conjunction “because”, whereas
contextual token embeddings for downstream an- the latter are inferred by annotators between succes-
swer prediction.                                      sive pairs of text spans split by punctuation marks
                                                   5849
such as “.” or “;”. We simply take all the “Explicit”    2019; Chen et al., 2020a), we also learn graph
connectives as well as common punctuation marks          node representations to obtain higher-level features.
to form our discourse delimiter library (details are     However, we consider different graph construction
given in Appendix A), with which we delimit the          and encoding. Specifically, let G k = (V k , E k ) de-
texts into EDUs. For each data sample, we segment        note a graph corresponding to the k-th option in
the context and options, ignoring the question since     answer choices. For each node vi ∈ V, the node
the question usually does not carry logical content.     embedding vi is initialized with the correspond-
                                                         ing EDU embedding ei . Ni = {j|(vj , vi ) ∈ E k }
Discourse Graph Construction We define the
                                                         indicates the neighbors of node vi . Wrji is the ad-
discourse-based graphs with EDUs as nodes, the
                                                         jacency matrix for one of the two edge types, where
“Explicit” connectives as well as the punctuation
                                                         rE indicates graph edges corresponding to the ex-
marks as two types of edges. We assume that
                                                         plicit connectives, and rI indicates graph edges
each connective or punctuation mark connects the
                                                         corresponding to punctuation marks.
EDUs before and after it. For example, the op-
                                                            The model first calculates weight αi for each
tion sentence in Figure 1 is delimited into two
                                                         node with a linear transformation and a sigmoid
EDUs, EDU7 =“digital systems are the best infor-
                                                         function αi = σ(Wα (vi ) + bα ), then conducts
mation systems” and EDU8 =“error cannot occur
                                                         message propagation with the weights:
in the emission of digital signals” by the connec-
tive r =“because”. Then the returned triplets are                  1 X
(EDU7 , r, EDU8 ) and (EDU8 , r, EDU7 ). For each         ṽi =         ( αj Wrji vj ), rji ∈ {rE , rI } (1)
                                                                  |Ni |
                                                                       j∈Ni
data sample with the context and multiple answer
options, we separately construct graphs correspond-      where ṽi is the message representation of node vi .
ing to each option, with EDUs in the same context        αj and vj are the weight and the node embedding
and every single option. The graph for the single        of vj respectively.
option k is denoted by G k = (V k , E k ).                  After the message propagation, the node repre-
2.2   Discourse-Aware Graph Network                      sentations are updated with the initial node embed-
                                                         dings and the message representations by
We present the Discourse-Aware Graph Network
(DAGN) that uses the constructed graph to exploit                  vi0 = ReLU(Wu vi + ṽi + bu ),          (2)
discourse-based information for answering logical
questions. It consists of three main components: an      where Wu and bu are weight and bias respectively.
EDU encoding module, a graph reasoning module,           The updated node representations vi0 will be used
and an answer prediction module. The former two          to enhance the contextual token embedding via
are demonstrated in Figure 1(2), whereas the final       summation in corresponding positions. Thus t0l =
component is in Figure 1(3).                             tl + vn0 , where l ∈ Sn and Sn is the corresponding
                                                         token indices set for n-th EDU.
EDU Encoding An EDU span embedding is
obtained from its token embeddings. There            Answer Prediction The probabilities of options
are two steps. First, similar to previous works      are obtained by feeding the discourse-enhanced
(Yu et al., 2020; Liu et al., 2020), we en-          token embeddings into the answer prediction mod-
code such input sequence “ context            ule. The model is end-to-end trained using cross
question || option ” into contex-                entropy loss. Specifically, the embedding sequence
tual token embeddings with pre-trained language      first goes through a layer normalization (Ba et al.,
models, where  and  are the special to-       2016), then a bidirectional GRU (Cho et al., 2014).
kens for RoBERTa (Liu et al., 2019) model, and       The output embeddings are then added to the input
|| denotes concatenation. Second, given the token    ones as the residual structure (He et al., 2016). We
embedding sequence {t1 , t2 , ..., tLP
                                     }, the n-th EDU finally obtain the encoded sequence after another
embedding is obtained by en =            tl , where Sn
                                                     layer normalization on the added embeddings.
                                  l∈Sn                  We then merge the high-level discourse features
is the set of token indices belonging to n-th EDU.
                                                     and the low-level token features. Specifically, the
Graph Reasoning After EDU encoding, DAGN             variant-length encoded context sequence, question-
performs reasoning over the discourse graph. In- and-option sequence are pooled via weighted sum-
spired by previous graph-based models (Ran et al., mation wherein the weights are softmax results of
                                                  5850
Methods              Dev      Test    Test-E    Test-H                Methods                Dev   Test
    BERT-Large           53.80    49.80    72.00     32.30                BERT-Large         34.10     31.03
    XLNet-Large          62.00    56.00    75.70     40.50                RoBERTa-Large      35.02     35.33
    RoBERTa-Large        62.60    55.60    75.50     40.00                DAGN               35.48     38.71
    DAGN                 65.20    58.20    76.14     44.11                DAGN (Aug)         36.87     39.32
    DAGN (Aug)           65.80    58.30    75.91     44.46
    *
        The results are taken from the ReClor paper.            Table 2: Experimental results (accuracy %) of DAGN
    *
        DAGN ranks the 1st on the public ReClor leader-         compared with baseline models on LogiQA dataset.
        board1 until 17th Nov., 2020 before submitting it to
        NAACL. Until now, we find that several better results     Methods                                      Dev
        appeared in the leaderboard and they are not opened.
                                                                  DAGN                                         65.20
Table 1: Experimental results (accuracy %) of DAGN                ablation on nodes
compared with baseline models on ReClor dataset.                  DAGN - clause nodes                          64.40
Test-E = Test-EASY, Test-H = Test-HARD.                           DAGN - sentence nodes                        64.40
                                                                  ablation on edges
                                                                  DAGN - single edge type                      64.80
a linear transformation of the sequence, resulting                DAGN - fully connected edges                 61.60
in single feature vectors separately. We concate-                 ablation on graph reasoning
                                                                  DAGN w/o graph module                        64.00
nate them with “” embedding from the back-
bone pre-trained model, and feed the new vector                 Table 3: Ablation study results (accurcy %) on ReClor
into a two-layer perceptron with a GELU activa-                 development set.
tion (Hendrycks and Gimpel, 2016) to get the out-
put features for classification.                                models. As for DAGN, we fine-tune RoBERTa-
                                                                Large as the backbone. DAGN (Aug) is a variant
3        Experiments                                            that augments the graph features.
                                                                   DAGN reaches 58.20% of test accuracy on
We evaluate the performance of DAGN on two log-
                                                                ReClor. DAGN (Aug) reaches 58.30%, therein
ical reasoning datasets, ReClor (Yu et al., 2020)
                                                                75.91% on EASY subset, and 44.46% on HARD
and LogiQA (Liu et al., 2020), and conduct ab-
                                                                subset. Compared with RoBERTa-Large, the
lation study on graph construction and graph net-
                                                                improvement on the HARD subset is remark-
work. The implementation details are shown in
                                                                ably 4.46%. This indicates that the incorporated
Appendix B.
                                                                discourse-based information supplements the short-
                                                                coming of the baseline model, and that the dis-
3.1 Datasets
                                                                course features are beneficial for such logical rea-
ReClor contains 6,138 questions modified from                   soning. Besides, DAGN and DAGN (Aug) also
standardized tests such as GMAT and LSAT, which                 outperform the baseline models on LogiQA, espe-
are split into train / dev / test sets with 4,638 / 500         cially showing 4.01% improvement over RoBERTa-
/ 1,000 samples respectively. The training set and              Large on the test set.
the development set are available. The test set is
blind and hold-out, and split into an EASY subset               3.3 Ablation Study
and a HARD subset according to the performance                  We conduct ablation study on graph construction
of BERT-base model (Devlin et al., 2019). The                   details as well as the graph reasoning module. The
test results are obtained by submitting the test pre-           results are reported in Table 3.
dictions to the leaderboard. LogiQA consists of
8,678 questions that are collected from National                Varied Graph Nodes We first use clauses or sen-
Civil Servants Examinations of China and manu-                  tences in substitution for EDUs as graph nodes. For
ally translated into English by professionals. The              clause nodes, we simply remove “Explicit” connec-
dataset is randomly split into train / dev / test sets          tives during discourse unit delimitation. So that
with 7,376 / 651 / 651 samples respectively. Both               the texts are just delimited by punctuation marks.
datasets contain multiple logical reasoning types.              For sentence nodes, we further reduce the delim-
                                                                iter library to solely period (“.”). Using the modi-
3.2 Results                                                     fied graphs with clause nodes or coarser sentence
                                                                nodes, the accuracy of DAGN drops to 64.40%.
The experimental results are shown in Tables 1
                                                                This indicates that clause or sentence nodes carry
and 2. Since there is no public method for both
                                                                   1
datasets, we compare DAGN with the baseline                            https://bit.ly/2UOQfaS
                                               5851
less discourse information and act poorly as logical     based on graphs. There are also other methods
reasoning units.                                         such as neuro-symbolic models (Saha et al., 2021)
                                                         and adversarial training (Pereira et al., 2020). Our
Varied Graph Edges We make two changes of                paper uses a graph-based model. However, for
the edges: (1) modifying the edge type, (2) mod-         uncovering logical relations, graph nodes and edges
ifying the edge linking. For edge type, all edges        are customized with discourse information.
are regarded as a single type. For edge linking, we         Discourse information provides a high-level un-
ignore discourse relations and connect every pair of     derstanding of texts and hence is beneficial for
nodes, turning the graph into fully-connected. The       many of the natural language tasks, for instance,
resulting accuracies drop to 64.80% and 61.60%           text summarization (Cohan et al., 2018; Joty et al.,
respectively. It is proved that in the graph we built,   2019; Xu et al., 2020a; Feng et al., 2020), neural
edges link EDUs in reasonable manners, which             machine translation (Voita et al., 2018), and coher-
properly indicates the logical relations.                ent text generation (Wang et al., 2020; Bosselut
Ablation on Graph Reasoning We remove the                et al., 2018). There are also discourse-based ap-
graph module from DAGN and give a comparison.            plications for reading comprehension. DISCERN
This model solely contains an extra prediction mod-      (Gao et al., 2020) segments texts into EDUs and
ule than the baseline. The performance on ReClor         learns interactive EDU features. Mihaylov and
dev set is between the baseline model and DAGN.          Frank (2019) provide additional discourse-based
Therefore, despite the prediction module benefits        annotations and encodes them with discourse-
the accuracy, the lack of graph reasoning leads to       aware self-attention models. Unlike previous
the absence of discourse features and degenerates        works, DAGN first uses discourse relations as
the performance. It demonstrates the necessity of        graph edges connecting EDUs for texts, then learns
discourse-based structure in logical reasoning.          the discourse features via message passing with
                                                         graph neural networks.
4   Related Works
                                                         5   Conclusion
Recent datasets for reading comprehension tend to
be more complicated and require models’ capabil- In this paper, we introduce a Discourse-Aware
ity of reasoning. For instance, HotpotQA (Yang          Graph Network (DAGN) to addressing logical rea-
et al., 2018), WikiHop (Welbl et al., 2018), Open- soning QA tasks. We first treat elementary dis-
BookQA (Mihaylov et al., 2018), and MultiRC             course units (EDUs) that are split by discourse
(Khashabi et al., 2018) require the models to have      relations as basic reasoning units. We then build
multi-hop reasoning. DROP (Dua et al., 2019) and        discourse-based logic graphs with EDUs as nodes
MA-TACO (Zhou et al., 2019) need the models to          and discourse relations as edges. DAGN then learns
have numerical reasoning. WIQA (Tandon et al., the discourse-based features and enhances them
2019) and CosmosQA (Huang et al., 2019) require         with contextual token embeddings. DAGN reaches
causal reasoning that the models can understand         competitive performances on two recent logical
the counterfactual hypothesis or find out the cause- reasoning datasets ReClor and LogiQA.
effect relationships in events. However, the logical
reasoning datasets (Yu et al., 2020; Liu et al., 2020)
                                                        Acknowledgements
require the models to have the logical reasoning        The authors would like to thank Wenge Liu,
capability of uncovering the inner logic of texts.      Jianheng Tang, Guanlin Li and Wei Wang for
   Deep neural networks are used for reasoning- their support and useful discussions.                 This
driven RC. Evidence-based methods (Madaan et al., work was supported in part by National Nat-
2020; Huang et al., 2020; Rajagopal et al., 2020) ural Science Foundation of China (NSFC) un-
generate explainable evidence from a given context      der Grant No.U19A2073 and No.61976233,
as the backup of reasoning. Graph-based methods         Guangdong Province Basic and Applied Ba-
(Qiu et al., 2019; De Cao et al., 2019; Cao et al., sic Research (Regional Joint Fund-Key) Grant
2019; Ran et al., 2019; Chen et al., 2020b; Xu          No.2019B1515120039, Shenzhen Basic Research
et al., 2020b; Zhang et al., 2020) explicitly model     Project (Project No. JCYJ20190807154211365),
the reasoning process with constructed graphs, then     Zhijiang Lab’s Open Fund (No. 2020AA3AB14)
learn and update features through message passing       and CSIG Young Fellow Support Fund.
                                                     5852
References                                               Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                            Kristina Toutanova. 2019. Bert: Pre-training of
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-         deep bidirectional transformers for language under-
   ton. 2016. Layer normalization. stat, 1050:21.           standing. In Proceedings of the 2019 Conference of
                                                            the North American Chapter of the Association for
Antoine Bosselut, Asli Celikyilmaz, Xiaodong He,            Computational Linguistics: Human Language Tech-
  Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018.         nologies, Volume 1 (Long and Short Papers), pages
  Discourse-aware neural rewards for coherent text          4171–4186.
  generation. In Proceedings of the 2018 Conference
  of the North American Chapter of the Association       Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
  for Computational Linguistics: Human Language            Stanovsky, Sameer Singh, and Matt Gardner. 2019.
  Technologies, Volume 1 (Long Papers), pages 173–         Drop: A reading comprehension benchmark requir-
  184.                                                     ing discrete reasoning over paragraphs. In Proceed-
                                                           ings of the 2019 Conference of the North American
Yu Cao, Meng Fang, and Dacheng Tao. 2019. Bag:             Chapter of the Association for Computational Lin-
  Bi-directional attention entity graph convolutional      guistics: Human Language Technologies, Volume 1
  network for multi-hop reasoning question answering.      (Long and Short Papers), pages 2368–2378.
  In Proceedings of the 2019 Conference of the North
  American Chapter of the Association for Computa-       Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin-
  tional Linguistics: Human Language Technologies,         wei Geng, and Ting Liu. 2020.           Dialogue
  Volume 1 (Long and Short Papers), pages 357–362.         discourse-aware graph convolutional networks for
                                                           abstractive meeting summarization. arXiv preprint
Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi-              arXiv:2012.03502.
  aochuan, Yuyu Zhang, Le Song, Taifeng Wang,
  Yuan Qi, and Wei Chu. 2020a. Question directed
                                                         Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty,
  graph attention network for numerical reasoning
                                                           Steven CH Hoi, Caiming Xiong, Irwin King, and
  over text. In Proceedings of the 2020 Conference on
                                                           Michael Lyu. 2020. Discern: Discourse-aware en-
  Empirical Methods in Natural Language Processing
                                                           tailment reasoning network for conversational ma-
  (EMNLP), pages 6759–6768.
                                                           chine reading. In Proceedings of the 2020 Confer-
                                                           ence on Empirical Methods in Natural Language
Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi-
                                                           Processing (EMNLP), pages 2439–2449.
  aochuan, Yuyu Zhang, Le Song, Taifeng Wang,
  Yuan Qi, and Wei Chu. 2020b. Question directed
  graph attention network for numerical reasoning        Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
  over text. In Proceedings of the 2020 Conference         Sun. 2016. Deep residual learning for image recog-
  on Empirical Methods in Natural Language Process-        nition. In Proceedings of the IEEE conference on
  ing (EMNLP), pages 6759–6768, Online. Associa-           computer vision and pattern recognition, pages 770–
  tion for Computational Linguistics.                      778.

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah-        Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
  danau, and Yoshua Bengio. 2014. On the properties        sian error linear units (gelus). arXiv preprint
  of neural machine translation: Encoder–decoder ap-       arXiv:1606.08415.
  proaches. In Proceedings of SSST-8, Eighth Work-
  shop on Syntax, Semantics and Structure in Statisti-   Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
  cal Translation, pages 103–111.                          Yejin Choi. 2019. Cosmos qa: Machine reading
                                                            comprehension with contextual commonsense rea-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,              soning. In Proceedings of the 2019 Conference on
  Trung Bui, Seokhwan Kim, Walter Chang, and Na-            Empirical Methods in Natural Language Processing
  zli Goharian. 2018. A discourse-aware attention           and the 9th International Joint Conference on Natu-
  model for abstractive summarization of long docu-         ral Language Processing (EMNLP-IJCNLP), pages
  ments. In Proceedings of the 2018 Conference of           2391–2401.
  the North American Chapter of the Association for
  Computational Linguistics: Human Language Tech-        Yinya Huang, Meng Fang, Xunlin Zhan, Qingxing
  nologies, Volume 2 (Short Papers), pages 615–621,        Cao, Xiaodan Liang, and Liang Lin. 2020. Rem-
  New Orleans, Louisiana. Association for Computa-         net: Recursive erasure memory network for com-
  tional Linguistics.                                      monsense evidence refinement.    arXiv preprint
                                                           arXiv:2012.13185.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.
  Question answering by reasoning across documents       Shafiq Joty, Giuseppe Carenini, Raymond Ng, and
  with graph convolutional networks. In Proceed-           Gabriel Murray. 2019. Discourse analysis and its ap-
  ings of the 2019 Conference of the North American        plications. In Proceedings of the 57th Annual Meet-
  Chapter of the Association for Computational Lin-        ing of the Association for Computational Linguistics:
  guistics: Human Language Technologies, Volume 1          Tutorial Abstracts, pages 12–17, Florence, Italy. As-
  (Long and Short Papers), pages 2306–2317.                sociation for Computational Linguistics.
                                                     5853
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,           Association for Computational Linguistics, pages
  Shyam Upadhyay, and Dan Roth. 2018. Looking be-            6140–6150, Florence, Italy. Association for Compu-
  yond the surface: A challenge set for reading com-         tational Linguistics.
  prehension over multiple sentences. In Proceedings
  of the 2018 Conference of the North American Chap-      Dheeraj Rajagopal, Niket Tandon, Peter Clark, Bha-
  ter of the Association for Computational Linguistics:     vana Dalvi, and Eduard Hovy. 2020. What-if i ask
  Human Language Technologies, Volume 1 (Long Pa-           you to explain: Explaining the effects of perturba-
  pers), pages 252–262.                                     tions in procedural text. In Proceedings of the 2020
                                                            Conference on Empirical Methods in Natural Lan-
Diederik P Kingma and Jimmy Ba. 2015. Adam:                 guage Processing: Findings, pages 3345–3355.
  A method for stochastic optimization. In ICLR
  (Poster).                                               Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
                                                            Percy Liang. 2016. Squad: 100,000+ questions for
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang,            machine comprehension of text. In Proceedings of
   Yile Wang, and Yue Zhang. 2020. Logiqa: A chal-          the 2016 Conference on Empirical Methods in Natu-
   lenge dataset for machine reading comprehension          ral Language Processing, pages 2383–2392.
   with logical reasoning. IJCAI 2020.
                                                          Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-         Liu. 2019. Numnet: Machine reading comprehen-
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,             sion with numerical reasoning. In Proceedings of
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.             the 2019 Conference on Empirical Methods in Nat-
  Roberta: A robustly optimized bert pretraining ap-        ural Language Processing and the 9th International
  proach. arXiv preprint arXiv:1907.11692.                  Joint Conference on Natural Language Processing
                                                            (EMNLP-IJCNLP), pages 2474–2484.
Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Ab-
 hilasha Ravichander, Eduard Hovy, and Shrimai            Amrita Saha, Shafiq Joty, and Steven CH Hoi. 2021.
 Prabhumoye. 2020. Eigen: Event influence gen-             Weakly supervised neuro-symbolic module net-
 eration using pre-trained language models. arXiv          works for numerical reasoning. arXiv preprint
 preprint arXiv:2010.11764.                                arXiv:2101.11802.
William C Mann and Sandra A Thompson. 1988.
                                                          Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Pe-
  Rhetorical structure theory: Toward a functional the-
                                                            ter Clark, and Antoine Bosselut. 2019. Wiqa: A
  ory of text organization. Text, 8(3):243–281.
                                                            dataset for “what if...” reasoning over procedural
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish        text. In Proceedings of the 2019 Conference on
  Sabharwal. 2018. Can a suit of armor conduct elec-        Empirical Methods in Natural Language Processing
  tricity? a new dataset for open book question an-         and the 9th International Joint Conference on Natu-
  swering. In Proceedings of the 2018 Conference on         ral Language Processing (EMNLP-IJCNLP), pages
  Empirical Methods in Natural Language Processing,         6078–6087.
  pages 2381–2391.
                                                          Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Todor Mihaylov and Anette Frank. 2019. Discourse-           Titov. 2018. Context-aware neural machine trans-
  aware semantic self-attention for narrative reading       lation learns anaphora resolution. In Proceedings
  comprehension. In Proceedings of the 2019 Con-            of the 56th Annual Meeting of the Association for
  ference on Empirical Methods in Natural Language          Computational Linguistics (Volume 1: Long Papers),
  Processing and the 9th International Joint Confer-        pages 1264–1274.
  ence on Natural Language Processing (EMNLP-
  IJCNLP), pages 2541–2552, Hong Kong, China. As-         Wei Wang, Piji Li, and Hai-Tao Zheng. 2020. Con-
  sociation for Computational Linguistics.                  sistency and coherency enhanced story generation.
                                                            arXiv preprint arXiv:2010.08822.
Lis Pereira, Xiaodong Liu, Fei Cheng, Masayuki Asa-
   hara, and Ichiro Kobayashi. 2020. Adversarial train-   Johannes Welbl, Pontus Stenetorp, and Sebastian
   ing for commonsense inference. In Proceedings of         Riedel. 2018. Constructing datasets for multi-hop
   the 5th Workshop on Representation Learning for          reading comprehension across documents. Transac-
  NLP, pages 55–60, Online. Association for Compu-          tions of the Association for Computational Linguis-
   tational Linguistics.                                    tics, 6:287–302.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-       Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.
  sakaki, Livio Robaldo, Aravind K Joshi, and Bon-           2020a. Discourse-aware neural extractive text sum-
  nie L Webber. 2008. The penn discourse treebank            marization. In Proceedings of the 58th Annual Meet-
  2.0. In LREC. Citeseer.                                    ing of the Association for Computational Linguistics,
                                                             pages 5021–5031, Online. Association for Computa-
Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei               tional Linguistics.
  Li, Weinan Zhang, and Yong Yu. 2019. Dynami-
  cally fused graph network for multi-hop reasoning.      Yunqiu Xu, Meng Fang, Ling Chen, Yali Du,
  In Proceedings of the 57th Annual Meeting of the          Joey Tianyi Zhou, and Chengqi Zhang. 2020b. Deep
                                                      5854
reinforcement learning with stacked hierarchical at-   A      Discourse Delimiter Library
  tention for text-based games. In Advances in Neural
  Information Processing Systems, volume 33, pages       Our discourse delimiter library consists of two
  16495–16507. Curran Associates, Inc.                   parts, the “Explicit” connectives annotated in Penn
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,     Discourse TreeBank 2.0 (DPTB 2.0) (Prasad et al.,
  William Cohen, Ruslan Salakhutdinov, and Christo-      2008), as well as a set of punctuation marks. The
  pher D Manning. 2018. Hotpotqa: A dataset for          overall discourse delimiters used in our method are
  diverse, explainable multi-hop question answering.     presented in Table 4.
  In Proceedings of the 2018 Conference on Empiri-
  cal Methods in Natural Language Processing, pages                               Explicit Connectives
  2369–2380.                                                         ’once’, ’although’, ’though’, ’but’, ’because’,
                                                                 ’nevertheless’, ’before’, ’for example’, ’until’, ’if’,
Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi            ’previously’, ’when’, ’and’, ’so’, ’then’, ’while’, ’as long
  Feng. 2020. Reclor: A reading comprehension               as’, ’however’, ’also’, ’after’, ’separately’, ’still’, ’so that’,
  dataset requiring logical reasoning. In ICLR 2020            ’or’, ’moreover’, ’in addition’, ’instead’, ’on the other
  : Eighth International Conference on Learning Rep-              hand’, ’as’, ’for instance’, ’nonetheless’, ’unless’,
  resentations.                                               ’meanwhile’, ’yet’, ’since’, ’rather’, ’in fact’, ’indeed’,
                                                             ’later’, ’ultimately’, ’as a result’, ’either or’, ’therefore’,
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan             ’in turn’, ’thus’, ’in particular’, ’further’, ’afterward’,
   Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-               ’next’, ’similarly’, ’besides’, ’if and when’, ’nor’,
   tree learning for solving math word problems. In             ’alternatively’, ’whereas’, ’overall’, ’by comparison’,
                                                            ’till’, ’in contrast’, ’finally’, ’otherwise’, ’as if’, ’thereby’,
   Proceedings of the 58th Annual Meeting of the Asso-
                                                            ’now that’, ’before and after’, ’additionally’, ’meantime’,
   ciation for Computational Linguistics, pages 3928–               ’by contrast’, ’if then’, ’likewise’, ’in the end’,
   3937.                                                      ’regardless’, ’thereafter’, ’earlier’, ’in other words’, ’as
                                                            soon as’, ’except’, ’in short’, ’neither nor’, ’furthermore’,
Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan                      ’lest’, ’as though’, ’specifically’, ’conversely’,
  Roth. 2019. “going on a vacation” takes longer                  ’consequently’, ’as well’, ’much as’, ’plus’, ’and’,
  than “going for a walk”: A study of temporal com-              ’hence’, ’by then’, ’accordingly’, ’on the contrary’,
  monsense understanding. In Proceedings of the              ’simultaneously’, ’for’, ’in sum’, ’when and if’, ’insofar
  2019 Conference on Empirical Methods in Natu-                as’, ’else’, ’as an alternative’, ’on the one hand on the
  ral Language Processing and the 9th International                                     other hand’
  Joint Conference on Natural Language Processing                                  Punctuation Marks
  (EMNLP-IJCNLP), pages 3363–3369, Hong Kong,                                          ’.’, ’,’, ’;’, ’:’
  China. Association for Computational Linguistics.
                                                         Table 4: The discourse delimiter library in our imple-
                                                         mentation.

                                                         B      Implementation Details
                                                         We fine-tune RoBERTa-Large (Liu et al., 2019)
                                                         as the backbone pre-trained language model for
                                                         DGAN, which contains 24 hidden layers with hid-
                                                         den size 1024. The overall model is end-to-end
                                                         trained and updated by Adam (Kingma and Ba,
                                                         2015) optimizer with an overall learning rate of 5e-
                                                         6 and a weight decay of 0.01. The overall dropout
                                                         rate is 0.1. The maximum sequence length is 256.
                                                         We tune the model on the dev set to obtain the best
                                                         iteration steps of graph reasoning, which is 2 for
                                                         ReClor data, and 3 for LogiQA data. The model
                                                         is trained for 10 epochs with a batch size of 16 on
                                                         Nvidia Tesla V100 GPU.
                                                            For the answer prediction module, the hidden
                                                         size of GRU is the same as the token embeddings in
                                                         the pre-trained language model, which is 1024. The
                                                         two-layer perceptron first projects the concatenated
                                                         vectors with a hidden size of 1024 × 3 to 1024,
                                                         then project 1024 to 1.
                                                     5855
You can also read