Where Are We in Discourse Relation Recognition? - SIGdial

Page created by Albert Gross
 
CONTINUE READING
Where Are We in Discourse Relation Recognition?

       Katherine Atwell               Junyi Jessy Li               Malihe Alikhani
    Dept. of Computer Science      Dept. of Linguistics        Dept. of Computer Science
     University of Pittsburgh The University of Texas at Austin University of Pittsburgh
      kaa139@pitt.edu jessy@austin.utexas.edu malihe@pitt.edu

                        Abstract                             This discourse shows the classic pattern of implicit
                                                             information. The overall point is that Alice had a
     Discourse parsers recognize the intentional
     and inferential relationships that organize ex-         negative opinion of the story: the underlying expla-
     tended texts. They have had a great influence           nation is that the story was not interesting because
     on a variety of NLP tasks as well as theoreti-          it had no surprises. But given available lexical re-
     cal studies in linguistics and cognitive science.       sources and sentiment detection methods, we can
     However it is often difficult to achieve good           capture such inferences systematically by recog-
     results from current discourse models, largely          nizing that they follow common general patterns,
     due to the difficulty of the task, particularly
                                                             known as “discourse relations”, and are guided by
     recognizing implicit discourse relations. Re-
     cent developments in transformer-based mod-             shallow cues.
     els have shown great promise on these analy-               An example of an instance in which discourse
     ses, but challenges still remain. We present            analysis can produce insights that may be missed by
     a position paper which provides a systematic            employing other NLP methods is this example from
     analysis of the state of the art discourse parsers.     Taboada (2016), where without discourse relations
     We aim to examine the performance of current            it may be difficult to capture sentiment:
     discourse parsing models via gradual domain
     shift: within the same corpus, on in-domain                 (2) “While this book is totally different from
     texts, and on out-of-domain texts, and discuss                  any other book he has written to date, it
     the differences between the transformer-based                   did not disappoint me at all.”
     models and the previous models in predicting
     different types of implicit relations both inter-       This represents a Concession relation according
     and intra-sentential. We conclude by describ-           to both Rhetorical Structure Theory and the Penn
     ing several shortcomings of the existing mod-           Discourse Treebank (where it is notated as Com-
     els and a discussion of how future work should          parison.Concession), resolving the incongruity of the
     approach this problem.
                                                             first clause being negative and the second clause
1    Introduction                                            being positive by illustrating how the negative state-
                                                             ment in the subordinate clause is reversed by the
Discourse analysis is a crucial analytic level in NLP.
                                                             positive one in the main clause.
In natural language discourse, speakers and writers
                                                                The importance of discourse has led to active
often rely on implicit inference to signal the kind
                                                             research based on predicting what coherence re-
of contribution they are making to the conversation,
                                                             lations are present in text based on shallow infor-
as well as key relationships that justify their point
                                                             mation. The predicted relations are then used to
of view. While early AI literature is full of case
                                                             draw inferences from the text. The value of pre-
studies suggesting that this inference is complex,
                                                             dicting the semantic classes of coherence relations
open-ended and knowledge-heavy (e.g., Charniak
                                                             has been demonstrated in several applications, in-
(1973); Schank and Abelson (1977)), recent work
                                                             cluding sentiment analysis (Marcu, 2000; Bhatia
on computational discourse coherence offers a dif-
                                                             et al., 2015), machine comprehension (Narasimhan
ferent approach. Take the following example from
                                                             and Barzilay, 2015), summarization (Cohan et al.,
Pitler and Nenkova (2008):
                                                             2018; Marcu, 1999; Xu et al., 2019; Kikuchi et al.,
    (1) “Alice thought the story was predictable.            2014), and predicting instructor intervention in an
        She found it boring.”                                online course discussion forum (Chandrasekaran

                                                           314
     Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 314–325
                           July 29–31, 2021. ©2021 Association for Computational Linguistics
et al., 2017). However, it is still the case that few       structure and vocabulary.
works have so far found discourse relations as key             Additionally, as part of our evaluation, we asked
features (Zhong et al., 2020). We argue that one            linguists to perform manual annotation, which al-
reason for this gap between theory and empirical            lowed us to evaluate the accuracy of these parsers
evidence is the quality of the parsers exacerbated          on plain, unlabeled text, and gain some insight
by the distributional shifts in the texts they need to      about the mistakes made by the parsers. During the
apply to.                                                   annotation process, we uncovered information that
   The necessity of discourse research has resulted         can guide future research, including but not limited
in several shared tasks (Xue et al., 2015, 2016) and        to the critical role of context for implicit discourse
corpora development in multiple languages (Zeyrek           sense classification. We discuss this need for con-
and Webber, 2008; Meyer et al., 2011; Danlos et al.,        text, hypothesize what scenarios may cause two
2012; Zhou et al., 2014; Zeyrek et al., 2020). Yet          arguments to need additional context, and provide
shallow discourse parsing is a very difficult task;         some examples for which this is the case. We urge
more than 10 years after the introduction of the            future researchers to consider developing context-
Penn Discourse Treebank (Eleni Miltsakaki, 2004),           aware models for shallow discourse parsing mov-
performance for English implicit discourse rela-            ing forward. We release our dataset to facilitate
tion recognition has gone from 40.2 F-1 (Lin et al.,        further discourse analysis under domain shift. 1
2009) to 47.8 (Lee et al., 2020), less than 8 per-
centage points; a similar story could be said about         2   Related Work
the relation prediction performance of RST parsers.       There are various frameworks for studying inferen-
Such performance hinders the wider application of         tial links between discourse segments, from local
parsers. If downstream tasks are to use predicted         shallow relations between discourse segments in
relation senses, the data to which the systems are        PDTB (Rashmi Prasad, 2008) to hierarchical con-
applied is typically different from their training        stituent structures in RST (Carlson et al., 2003) or
data—the Wall Street Journal (WSJ) in a 3-year            discourse graphs in Segmented Discourse Repre-
window—to varying degrees. This tends to further          sentation Theory (SDRT) (Asher et al., 2003) and
aggravate the low performance observed. As a re-          the Discourse Graphbank (Wolf and Gibson, 2005).
sult, often we find that adding parsed discourse             Rhetorical Structure Theory (RST) (Mann and
relations into models are unhelpful.                      Thompson, 1987) provides a hierarchical structure
   Although domain difference is a recognized             for analyzing text that describes relations between
issue in shallow discourse parsing by existing            text spans known as elementary discourse units
work (Braud et al., 2017; Liu et al., 2016), we           (EDUs). The RST Discourse Treebank (Carlson
still have little understanding of the types of distri-   et al., 2003) contains 385 Wall Street Journal arti-
butional shift that matter and by how much, even          cles from the Penn Treebank (Marcus et al., 1993)
within one language. This position paper seeks to         which have been split into elementary discourse
shed some light on our current state in discourse         units and annotated according to Rhetorical Struc-
parsing in English. Surprisingly, we found that           ture Theory, where discourse relations are anno-
parsers have some issues even within the same             tated in a tree structure across the whole document.
news source as the training set (WSJ); the differ-        A full list of these relations can be found in Carlson
ences in accuracy were not significant between in-        and Marcu (2001).
domain and out-of-domain data for the qualitative            The Penn Discourse Treebank (PDTB)
examples that we looked at, although the distribu-        (Eleni Miltsakaki, 2004; Rashmi Prasad, 2008;
tion of errors tend to be different. This differs from    Prasad et al., 2018), which also uses Penn Treebank
other NLP tasks such as entity recognition, where         Wall Street Journal articles, contains discourse
training on data in the target domain increased the       relations annotated in a shallow, non-hierarchical
F1 score by over 20 points (Bamman et al., 2019).         manner. For each relation between two arguments,
   We further found that parsers perform differ-          each argument and the discourse connective (word
ently on implicit discourse relations held within         or phrase that indicates the discourse relation)
vs. across sentences. We believe these findings are       are labeled. The PDTB also annotates whether a
strong evidence for the sensitivity of existing mod-          1
                                                                Our data is located here: https://github.com/
els to distributional shift in terms of both linguistic     katherine-atwell/discourse-domain-shift

                                                      315
relation is explicit or non-explicit, the latter type     proved evaluation protocol for the PDTB-2 (Kim
of which has three subtypes: Implicit, AltLex,            et al., 2020). In contrast, our work aims to analyze
and EntRel. In this paper, we focus on implicit           and evaluate existing discourse parsers via gradual
relations, where a connective can be inserted             domain shift. We provide a comparative genre-
between the two arguments that indicates a                based analysis on distributionally shifted text data
discourse relation. These relations are considered        and present a qualitative analysis of the impact of
extremely challenging for discourse parsers to            the practical choices that these models make while
automatically identify.                                   doing discourse parsing across frameworks.
   There is a need to examine the performance
of the proposed discourse parsers, their represen-        3     Where are we in discourse parsing?
tational choices, their generalizability, and inter-      3.1    Experiments
pretability both across domains, distributions, and
frameworks. One recently developed framework            Data. We start by focusing on possible distribu-
is the PDTB-3. Since its release in 2019, several       tional shifts in a shallow parser’s application, by
papers have evaluated the performance of implicit       considering different linguistic types of implicit dis-
sense classifiers on this new corpus, which includes    course relations (inter- vs intra-sentential) (Liang
newly annotated intra-sentential implicit discourse     et al., 2020). To do this, we evaluate performance
relations. In addition to proposing a new evalua-       on the PDTB-2 and PDTB-3, as well as the intra-
tion framework for PDTB, Kim et al. (2020) eval-        sentential relations in the PDTB-3 specifically.
uate the performance of pretrained encoders for            We then evaluate the performance of three
implicit sense classification on the PDTB-2 and         widely used or state-of-the-art models under grad-
the PDTB-3. Liang et al. (2020) identify locating       ual shift of the domain of texts, noting that users
the position of relations as a new challenge in the     who would want to use a parser will be applying
PDTB-3, due to the significantly increased number       it on data that varies linguistically to different de-
of intra-sentential implicit relations annotated.       grees from the parser’s training data (a fixed 3-year
                                                        window of WSJ articles). The data we examine
   Techniques of discourse parsing range from su-
                                                        is: WSJ texts outside of the Penn Treebank , other
pervised (Liu et al., 2019; Mabona et al., 2019; Lin
                                                        news texts, and the GUM corpus (Zeldes, 2017).
et al., 2019; Zhang et al., 2020; Kobayashi et al.,
                                                        Note that none of these texts contain gold PDTB
2020) and weakly supervised and unsupervised ap-
                                                        annotations, and only the GUM corpus contains
proaches (Lee et al., 2020; Nishida and Nakayama,
                                                        gold RST annotations.
2020; Kurfalı and Östling, 2019); recent develop-
ments such as word/contextual embeddings have           Setup. To examine the impact of changing the lin-
improved parser performance, although not as sig-       guistic distribution by introducing intra-sentential
nificantly as other tasks (Shi and Demberg, 2019;       discourse relations, we run the model developed by
Chen et al., 2019) Yet most works have made sim-        Chen et al. (2019) using the same train-test split as
plifying assumptions concerning the linguistic an-      the authors and training/testing on discourse senses
notations for practical purposes that affect their      which contain 10 or more examples. To get results
evaluation and generality. For instance, most shal-     for the PDTB-2, we train and test the model on
low discourse parsers use only the argument pairs       the PDTB-2; to get results for the PDTB-3 and in-
to determine the discourse sense without consider-      trasentential relations in the PDTB-3, we train the
ing further context. Additionally, in RST parsing,      model on the PDTB-3 and evaluate its performance
standard practice involves classifying only the 18      on both of these sets.
top-level RST classes (Hernault et al., 2010; Feng         To parse plain-text documents for PDTB rela-
and Hirst, 2014; Morey et al., 2017). Thus, all         tions, we use the Wang and Lan (2015) parser as
Elaboration relations are lumped together, mak-         our end-to-end parser and the Chen et al. (2019)
ing it a huge class. We reveal findings about these     DiscoEval parser as our implicit sense classifier.
assumptions in Section 4.                               The former is needed in order to parse unlabeled
  Other works evaluating discourse parsers include      text, and the latter is a more accurate BERT-based
DiscoEval (Chen et al., 2019), a test suite of eval-    implicit sense classifier (implicit sense classifica-
uation tasks that test the effectiveness of different   tion is the most difficult PDTB parsing task). To
sentence encoders for discourse parsers, and an im-     evaluate these parsers, we look at quantitative as-

                                                    316
PDTB-2       PDTB-3       PDTB-3               constituents. Additionally, they often co-occur with
                                       Intra-Sent            explicit relations.
                                                                Table 1 shows the accuracies of the base and
      Base     0.4236       0.4897       0.6251
                                                             large BERT model (Chen et al., 2019) on the im-
      Large    0.4358       0.5094       0.6251
                                                             plicit relations in the two versions of the PDTB.
Table 1: Accuracy of the BERT-based model described
                                                             The results on the PDTB-3 are significantly better
in Chen et al. (2019) on implicit relations in the PDTB.     than those of the PDTB-2, and the model tested on
                                                             the PDTB-3 intra-sentential relations significantly
                                                             outperformed both (p11.172). This mir-
pects of their output (e.g. the distributions) and           rors the results found from running the baseline
qualitative aspects (manual annotation and inspec-           model in Liang et al. (2020) on the PDTB-2, PDTB-
tion of parser output).                                      3, and PDTB-3 intra-sentential relations.
   For our RST experiments, we use the state-of-                Figure 1 shows the accuracy of the Wang et al.
the-art (Wang et al., 2017) parser. We evaluate the          (2017) parser on the inter-sentential and intra-
performance of this parser on the standard RST               sentential relations in the RST, respectively. For
Discourse Treebank test set with a 90-10 split (347          the inter-sentential relations, we sampled only the
training documents and 38 test documents). We                relations between two sentences to have a “fairer”
also evaluate it on the gold labels from the GUM             comparison (it is well known that performance suf-
corpus (but trained on the RST). Because GUM                 fers on higher levels of the RST tree). As with
is annotated with 20 different discourse relations           the PDTB, these results show a significant im-
which do not precisely map to the conventional 18            provement in performance when run on only the
types used in the Wang et al. (2017) parser, we map          intra-sentential relations compared to only the inter-
the ones that don’t match these types or the more            sentential relations.
fine-grained relations in the following manner, fol-            These results drive home the influence of the lin-
lowing Braud et al. (2017): preparation to BACK-             guistic and structural differences between intra- and
GROUND, justify and motivation to EXPLANA-                   inter-sentence implicit relations on the performance
TION, and solutionhood to TOPIC-COMMENT.                     of the parsers. We initially found this surprising
   For the plain-text news articles from outside of          since intra-sentence ones contain arguments with
the PDTB corpus, we mirror the PDTB experi-                  less information than their (full-sentence) inter-
ments on these documents by parsing them with                sentence counterparts. However, one explanation
the (Wang et al., 2017) parser, then examining the           for this is that, while looking for relations within
resulting distributions and manually inspecting the          sentence boundaries is a problem that has been
parser output.                                               very explored, and to some extent solved, in var-
3.2   Findings                                               ious NLP tasks (e.g. syntactic parsing), there are
                                                             not as many rules regarding relations that occur
Transformer-based models perform better on                   across sentence boundaries. Regardless of the
linguistically different intra-sentential relations          cause, these results illustrate that future shallow
than they do on inter-sentential relations. As               discourse parsers may benefit from accounting for
mentioned above, we aim to examine the results of            such linguistic differences explicitly.
distributional shifts in both vocabulary and linguis-
tic structure. Here, we look at shifts in linguistic         Parsers struggle to identify implicit relations
structure, namely, inter- vs. intra-sentence implicit        from less frequent classes. The second distribu-
discourse relations (Hobbs, 1985). The latter was            tional shift we examine is a shift in vocabulary. In
introduced in the PDTB-3 (Liang et al., 2020) from           order to capture this, we measure the performance
which we show the following example:                         across several domain shifts from the PDTB-2 us-
                                                             ing three datasets: WSJ articles from the COHA
  (3) ...Exxon Corp. built the plant but (Im-
                                                             corpus (Davies, 2012), other news articles from
      plicit=then) closed it in 1985
                                                             COHA, and the GUM corpus (Zeldes, 2017). The
Unlike the inter-sentence relations that were an-            WSJ articles are completely within the domain
notated across adjacent sentences, implicit intra-           of the PDTB, but more shifted in timeline than
sentence relations do not occur at well-defined po-          the PDTB test set. The other news articles are
sitions, but rather between varied types of syntactic        in-domain as well, but not from the same source

                                                           317
Figure 1: F-1 scores for running the Wang et al RST parser on the RST Discourse Treebank for inter-sentential (yel-
low) and intra-sentential (blue) relations (* denotes that this relation was not included in the set of inter-sentential
relations). We can see from this graph that the performance of the parser was improved for the intra-sentential
relations compared to the inter-sentential relations.

publication, and thus may be linguistically differ-          These results show us that if PDTB parsers are
ent. The GUM corpus, our out-of-domain dataset,              run on plain text documents, whether in-domain or
contains data from eight domains: Academic, Bio,             slightly shifted, the results are likely to be overcon-
Fiction, Interview, News, Travel, How-to guides,             fident with majority classes and unlikely to predict
and Forum Discussions. It contains gold RST an-              minority classes.
notations but no PDTB annotations.
   To quantitatively evaluate the performance of                         Wang et al. Level-2 Predictions
these parsing models, we examine the distribution               Sense                        WSJ other          GUM
of the parser predictions and how frequently differ-                                             news
ent senses are predicted. From this, we noticed that                                             articles
only 5 out of the 16 PDTB-2 level 2 senses were
predicted at all by the Wang and Lan parser, and                Expansion.Conjunction        15.2   22.7        12.1
only 7 out of 16 were predicted by the DiscoEval                Expansion.Instantiation      2.4    1.5         0.7
parser. Of these classes, several were predicted less           Expansion.Restatement        30.9   36.1        29.5
than 2% of the time (Table 6).                                  Comparison.Contrast          0.3    0.9         0.9
                                                                Contingency.Cause            51.3   38.7        56.7
   We can also see that in Tables 2 and 3, the
Wang et al parser predicted at least 38.7% Con-
                                                               Table 2: Percentages of Level-2 senses predicted by
tingency.Cause for all datasets and the DiscoEval
                                                               the Wang and Lan (2015) parser on the Penn Discourse
parser predicted at least 44% Contingency.Cause,               Treebank on Wall Street Journal articles, other news ar-
although these percentages were often much higher.             ticles, and the GUM corpus. All other 11 senses not
Because only 24.9% of the total relations con-                 included in this table were not predicted by the parser
tained in the PDTB are Contingency, this over-                 at all.
representation of Contingency.Cause in the predic-
tions indicates a strong bias towards Contingency.              We also obtained the predicted distributions of
Indeed, many of the errors found during annota-              the RST relations (Table 4) on the COHA news
tion occurred when the parser predicted Contin-              articles; we examined these results for the set of
gency.Cause, the most common level 2 sense, over a           WSJ articles as well as the other news articles. We
less represented class such as Comparison.Contrast;          found that relations that are highly represented in
the precision for Contingency.Cause was 0.33, 0.14,          the RST Discourse Treebank such as Elaboration,
and 0.33 for WSJ articles, non-WSJ news articles,            Attribution, and Same Unit were predicted much
and the GUM corpus respectively. This likely con-            more frequently than they appear in the RST. How-
tributed to the low accuracy for these documents.            ever, more minority classes were represented in

                                                         318
BERT Level-2 Predictions                                     WSJ Articles     Other News     GUM Corpus
                                                                 Level    Wang    Disco   Wang    Disco   Wang    Disco
 Sense                          WSJ other         GUM           Correct   et al   Eval    et al   Eval    et al   Eval
                                    news
                                                                None      46.7    60.0    35.3    41.2    43.8    23.8
                                    articles                    Level1    20.0     6.7    29.4    44.4    21.9    28.1
                                                                Level2    33.3    33.3    35.3    29.4    34.4    28.1
 Temporal.Asynchronous          1.3    1.6        4.2
 Expansion.Conjunction          16.4   20.9       19.6         Table 5: Resulting accuracies from annotating a sample
 Expansion.Instantiation        2.1    2.3        1.0          of implicit PDTB relations and comparing these anno-
 Expansion.List                 .7     .4         2.8          tations to the output of the Wang and DiscoEval parsers
 Expansion.Restatement          22.9   27.2       21.8
 Comparison.Contrast            2.1    3.1        1.0
                                                               alyze the effects of a change in the distribution of
 Comparison.Concession          0      .02        0
                                                               vocabulary by qualitatively analyzing the results
 Contingency.Cause              54.3   44.4       49.1
                                                               of our discourse parsers through manual inspec-
                                                               tion. To qualitatively evaluate the results of the
Table 3: Level-2 senses predicted by the BERT-based            PDTB parsers across domains, we randomly se-
model described in Chen et al. (2019) on the Penn Dis-         lected 64 implicit relations predicted by the parsers
course Treebank on Wall Street Journal articles, other
                                                               and asked two expert linguists (a faculty member
news articles, and the GUM corpus. All other 9 senses
not included in this table were not predicted by the           and a graduate student at a linguistics department)
parser at all, and thus were predicted 0% of the time.         to annotate them. These annotations allow us to
                                                               evaluate the accuracy of the parsers, since none
                                                               of the documents we are looking at (Wall Street
these predictions than in the PDTB parser’s.                   Journal articles in the COHA dataset, other news
                                                               articles, and the GUM corpus) have PDTB annota-
       Predicted RST Relation Percentages                      tions. More details about our annotation protocol
                            WSJ Arti- Other                    are provided at the beginning of Section 4.
                            cles      News                        The annotation results are in Table 5, where the
                                      Texts                    results of the parsers are compared to the ground
                                                               truth labels by the annotators.
   Attribution              22.02         21.38
   Background               2.66          2.98                    Across the three corpora, the annotators noticed
   Cause                    0.94          0.79                 that in many cases the relation type was labeled
   Comparison               0.90          0.49                 as EntRel or NoRel when it shouldn’t have been,
   Condition                2.96          1.93                 or vice versa. This led to discourse senses being
   Contrast                 4.69          3.86                 predicted for relations that did not have a discourse
   Elaboration              31.47         32.92                sense and vice versa. The parsers also often had
   Enablement               4.58          4.20                 issues with argument segmentation. For the GUM
   Evaluation               0.04          0.01                 corpus, segmentation was especially an issue in the
   Explanation              0.56          0.71                 travel genre, where headers or captions would be
   Joint                    9.49          9.21                 labeled as part of an argument.
   Manner-Means             1.13          1.04                    As is shown in Table 5, the percentage of implicit
   Same-Unit                17.2          19.31                relations that the parsers got right on the second
   Temporal                 1.31          1.18                 level appeared to decrease on average as the domain
                                                               shifted. However, this was a very slight decrease;
Table 4: Distribution of relations predicted by running        they had roughly the same level of accuracy across
the Wang et al. (2017) parser on COHA news articles.           all datasets, which was very low. In fact, for all
The 4 relations not listed here were not predicted at all      parsing models and datasets, a larger percentage of
by the parser.                                                 relations was predicted completely incorrectly.
                            ,                                     The results of running the state-of-the-art Wang
                                                               et al. (2017) parser on the gold labels of the RST
Models fail to generalize to both in-domain and                and GUM corpus are shown in Figure 2. These
out-of-domain data, and different errors are                   results make it clear that the RST parser performs
seen for different domains. We continue to an-                 much worse on out-of-domain data than it does on

                                                         319
Figure 2: F-1 scores for running the Wang et al RST parser on the RST Discourse Treebank (* indicates that a
relation was not annotated on the GUM corpus). For most relations, we see that the parser performed much better
on the RST test set than on the GUM articles.

               Wang and Lan               BERT                4.1   Annotation Details
              WSJ Other GUM WSJ Other GUM                 For the qualitative analysis, we ask two annotators
                                                          (a faculty member and a graduate student from
    Max       51.3   38.7   56.7   54.3   44.4   49.1     linguistics departments) to provide annotations for
    Std.dev   14.1   13.0   15.4   14.0   12.6   12.9     the data, as none of the texts contain gold PDTB
    0%        11     11     11     9      8      8        labels and only the GUM corpus contains gold
    0-2%      1      2      2      2      3      3        RST labels. The annotators were trained on, and
    2-5%      1      0      0      2      2      2        provided with, the PDTB 2.0 annotation manual
    >5%       3      3      3      3      3      3        (Prasad et al., 2007).
                                                             In order for the annotators to annotate this cor-
Table 6: Summary stats for running the Wang and Lan
parser and BERT parser on WSJ articles, other news
                                                          pus, discourse relations were randomly chosen
articles, and GUM. We study the % of predicted Level      from Wall Street Journal articles, other news ar-
2 PDTB relations, reporting the maximum, the standard     ticles, and the GUM corpus. 64 of these discourse
deviation, and # of sense types that were predicted 0%    relations were implicit, and are the only ones re-
of the time, 0-2%, etc.                                   ported in this paper. The annotators were given the
                                                          sentence(s) containing both arguments, with the
                                                          arguments labeled, and they also had access to the
RST corpus data. This is expected; it unsurpris-
                                                          article text if they ever needed to reference back
ingly does not generalize as well for text outside
                                                          to it. To assess the inter-rater agreement, we deter-
of its domain as for the news text contained within
                                                          mine Cohen’s κ value (Cohen, 1960). We randomly
the corpus test set due to a change in vocabulary.
                                                          selected 25 samples from the PDTB and assigned
However, in order for discourse parsers to be useful
                                                          each to the annotators. We obtained a Cohen’s κ of
for applications outside of the news domain, mod-
                                                          0.88, which indicates almost perfect agreement.
els that can more easily adapt to the target domain
must be developed.                                            4.2   Findings
                                                              More context than the two arguments is needed
4     Insights for model development                          to determine the correct discourse relation in
                                                              many cases One potential way to mitigate the im-
While inspecting the results of the annotations, we           pact of domain shift on the performance of shallow
found several helpful phenomena for developing                discourse parsers is to incorporate context. With
future models, including observations regarding               a few exceptions (Dai and Huang, 2018; Shi and
the role of context in shallow discourse parsing and          Demberg, 2019; Zhang et al., 2021), existing mod-
errors that current RST parsers are making.                   els for shallow discourse parsing mostly do not

                                                        320
Figure 3: RST parse tree containing a segment of the relations that were examined in the qualitative analysis. The
discourse sense labels on this tree that were examined in our analysis are marked red and green, where green is
correct and red is incorrect

use input beyond the two adjacent sentences that                     ing the day before – and that during a cou-
comprise the arguments of the relation (Kishimoto                    ple of hours , not knowing from one minute
et al., 2020; Chen et al., 2019). We found that only                 to the next what time the power is coming
considering these two sentences is not sufficient                    on. ” In this northern latitude it does n’t
even for our expert linguist annotators. Specifi-                    get dark in summer until about 10:30 p.m.
cally, while annotating the PDTB, the annotators                     so lighting is operate except at some crazy
found several examples where, when they looked at                    time like 11:45 at night , whenever there
the larger context behind the arguments and the sen-                 is power , unless they have stand-by diesel
tences where the arguments were contained, their                     generators. There ’s a year ’s supply of
annotations changed. Below, we describe a few                        diesel oil here.
examples that demonstrate the mistakes that can
be made without the full context and their implica-          The additional context, that people in the country
tions:                                                       described are dealing with electricity issues despite
                                                             there being a year’s worth of diesel supply, is now
  (4) In this northern latitude it does n’t get              made clear in this passage. Thus we can conclude
      dark in summer until about 10:30 p.m. so               that the correct relation here is Comparison.Contrast.
      lighting is operate except at some crazy               Without getting this context and just seeing the two
      time like 11:45 at night , whenever there              sentences in which the arguments are contained,
      is power , unless they have stand-by diesel            it is difficult to discern this as an annotator. This
      generators. There ’s a year ’s supply of               shows that by just getting exposure to the two argu-
      diesel oil here.                                       ments, without additional context, the sense may
                                                             be marked incorrectly. The Wang and Lan (2015)
This example is from the Wall Street Journal. At
                                                             parser and the DiscoEval parser both predicted this
first glimpse, one would think to annotate this as
                                                             incorrectly, with the Wang and Lan (2015) parser
Contingency.Factual present condition, but this does not
                                                             predicting it as Contingency.Cause and the BERT
capture the full context, which is shown below:
                                                             parser predicting it as Expansion.Conjunction.
  (5) One housewife says : ” With an electric                    Similarly, the following example, also contained
      kitchen I have to do my whole day ’s cook-             in this passage, has a different true annotation than

                                                           321
one would think from only seeing the arguments:           tachment issue can lead to error propagation where
                                                          the accuracy of the attachments further in the tree is
  (6) One housewife says : ” With an electric
                                                          impacted by that of the current one. Reducing this
      kitchen I have to do my whole day ’s cook-
                                                          error is of the utmost importance for future parsers.
      ing the day before – and that during a cou-
      ple of hours , not knowing from one minute          5   Conclusion and future work
      to the next what time the power is com-
      ing on . ” In this northern latitude it           Discourse parsing for text has seen a recent surge
      does n’t get dark in summer until about           in experimental approaches. In this work we pre-
      10:30 p.m. so lighting is operate except          sented a detailed analysis of the performance of
      at some crazy time like 11:45 at night            the state of the art discourse parsers and analysed
      , whenever there is power , unless they           their weaknesses and strength. The conclusions
      have stand-by diesel generators .                 drawn above from these experiments make it clear
                                                        that discourse parsing, though it has come a long
The relation may be deemed as Expansion. Instanti-      way in the past decade or so, still has a long way to
ation. However, by reading the full text, it is clear   go, particularly with respect to parsing on out-of-
that it should be labeled as Contingency.Cause. Like    domain texts and addressing issues of class imbal-
the last example, a clearer view of the full text is    ances, although the BERT-based model has made
needed to determine the proper annotation, not sim-     some improvements in this area. Additionally, we
ply the two arguments.                                  investigated how and when PDTB-3 can help in im-
   These observations provide insights as to why        proving the prediction of intra-sentential implicit
contextual embeddings with document context such        relations.
as the next sentence prediction task helps with            There are several promising future directions
implicit discourse relation classification (Shi and     for the area of discourse parsing. A model that
Demberg, 2019). More generally, we believe future       detects intra-sentential implicit relations is neces-
work on discourse parsing should look beyond only       sary in order to be able to parse on the PDTB-3.
the arguments of a relation because of the differ-      Exploring new neural parsing strategies is also a
ent interpretations one would give when taking the      must. We observed that neural parsers are ignorant
relation in vs. out of context. We believe that ar-     about what they do not know and overconfident
gument pairs with low specificity and one or more       when they make uninformed predictions. Quantify-
pronouns may be especially in need of this extra        ing prediction uncertainty directly by training the
context, but more experimentation will have to be       model to output high uncertainty for the data sam-
done to confirm this hypothesis.                        ples close to class boundaries can results in parsers
                                                        that can make better decisions. One takeaway of
Attachment issues tend to occur throughout the
                                                        our empirical analysis was the importance of the
RST parse tree, and relations are often misclas-
                                                        role of context in identifying the correct discourse
sified as Same-Unit and Elaboration. Regard-
                                                        relations. This observation suggests the need for
ing insights for the RST Discourse Treebank, a
                                                        new computational experiments that can identify
piece of the RST tree for this paragraph can be
                                                        the right context window that is required for the
seen in 3. Here, the EDU “One housewife says”
                                                        model to accurately predict relations.
should attach to the EDU after it, “With an elec-
tric kitchen I have to do my whole day’s cooking           Another useful direction is designing models that
the day before”. However, it instead attaches to        can learn discourse relations on their own without
EDUs from the preceding sentences, which is in-         the help of annotated corpora. There are several un-
correct, as these two sentences do not contain what     supervised models (Kobayashi et al., 2019; Nishida
the housewife says. We saw several other attach-        and Nakayama, 2020) that are used for determining
ment issues in the text, including a couple where       the structure of discourse parse trees but few that
the attachment should go up/down by several levels.     infer the relations themselves.
We also saw several instances of the relation being
                                                          Acknowledgements
incorrectly tagged as Same-Unit or Elaboration,
some of which can be seen in the diagram.               We would like to thank the reviewers, Diane Litman
   Attachment issues are a particular problem for       and Matthew Stone for providing helpful feedback
RST parsing due to its hierarchical nature; one at-     for this work.

                                                    322
References                                                   Jacob Cohen. 1960. A coefficient of agreement for
                                                                nominal scales. Educational and psychological mea-
Nicholas Asher, Nicholas Michael Asher, and Alex                surement, 20(1):37–46.
  Lascarides. 2003. Logics of Conversation. Cam-
  bridge University Press.                                   Zeyu Dai and Ruihong Huang. 2018. Improving im-
                                                               plicit discourse relation classification by modeling
David Bamman, Sejal Popat, and Sheng Shen. 2019.               inter-dependencies of discourse units in a paragraph.
  An annotated dataset of literary entities. In Proceed-       In Proceedings of the 2018 Conference of the North
  ings of the 2019 Conference of the North American            American Chapter of the Association for Computa-
  Chapter of the Association for Computational Lin-            tional Linguistics: Human Language Technologies,
  guistics: Human Language Technologies, Volume 1              Volume 1 (Long Papers).
  (Long and Short Papers), pages 2138–2144.
                                                             Laurence Danlos, Diégo Antolinos-Basso, Chloé
Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein.           Braud, and Charlotte Roze. 2012. Vers le fdtb:
  2015. Better document-level sentiment analysis               French discourse tree bank.      In TALN 2012:
  from RST discourse parsing. In Proceedings of                19ème conférence sur le Traitement Automatique
  the 2015 Conference on Empirical Methods in Nat-             des Langues Naturelles, volume 2, pages 471–478.
  ural Language Processing, pages 2212–2218, Lis-              ATALA/AFCP.
  bon, Portugal. Association for Computational Lin-
  guistics.                                                  Mark Davies. 2012. Expanding horizons in historical
                                                              linguistics with the 400-million word Corpus of His-
Chloé Braud, Maximin Coavoux, and Anders Søgaard.            torical American English. Corpora, 7(2):121–157.
  2017. Cross-lingual RST discourse parsing. In Pro-
  ceedings of the 15th Conference of the European            Aravind Joshi Bonnie Webber Eleni Miltsakaki,
  Chapter of the Association for Computational Lin-            Rashmi Prasad. 2004. The Penn Discourse Tree-
  guistics: Volume 1, Long Papers, pages 292–304.              bank. Proceedings of the Fourth International
                                                               Conference on Language Resources and Evaluation
Lynn Carlson and Daniel Marcu. 2001. Discourse tag-            (LREC’04).
  ging reference manual. ISI Technical Report ISI-TR-
  545, 54:56.                                                Vanessa Wei Feng and Graeme Hirst. 2014. A linear-
                                                               time bottom-up discourse parser with constraints
Lynn Carlson, Daniel Marcu, and Mary Ellen                     and post-editing. In Proceedings of the 52nd Annual
  Okurowski. 2003. Building a discourse-tagged cor-            Meeting of the Association for Computational Lin-
  pus in the framework of Rhetorical Structure Theory.         guistics (Volume 1: Long Papers), pages 511–521.
  In Current and new directions in discourse and dia-
  logue, pages 85–112. Springer.                             Hugo Hernault, Helmut Prendinger, Mitsuru Ishizuka,
                                                               et al. 2010. HILDA: A discourse parser using sup-
Muthu Kumar Chandrasekaran, Carrie Epp, Min-Yen                port vector machine classification. Dialogue & Dis-
 Kan, and Diane Litman. 2017. Using discourse sig-             course, 1(3).
 nals for robust instructor intervention prediction. In
 Proceedings of the AAAI Conference on Artificial In-        Jerry R. Hobbs. 1985. On the coherence and structure
 telligence, volume 31.                                         of discourse. Technical report, Center for the Study
                                                                of Language and Information, Stanford University.
Eugene Charniak. 1973. Jack and Janet in search of a
  theory of knowledge. In IJCAI, pages 337–343.              Yuta Kikuchi, Tsutomu Hirao, Hiroya Takamura, Man-
                                                               abu Okumura, and Masaaki Nagata. 2014. Single
Mingda Chen, Zewei Chu, and Kevin Gimpel. 2019.                document summarization based on nested tree struc-
  Evaluation benchmarks and learning criteria for              ture. In Proceedings of the 52nd Annual Meeting of
  discourse-aware sentence representations. In Pro-            the Association for Computational Linguistics (Vol-
  ceedings of the 2019 Conference on Empirical Meth-           ume 2: Short Papers), pages 315–320.
  ods in Natural Language Processing and the 9th In-
  ternational Joint Conference on Natural Language           Najoung Kim, Song Feng, Chulaka Gunasekara, and
  Processing (EMNLP-IJCNLP), pages 649–662.                    Luis Lastras. 2020. Implicit discourse relation clas-
                                                               sification: We need to talk about evaluation. In Pro-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,                 ceedings of the 58th Annual Meeting of the Asso-
  Trung Bui, Seokhwan Kim, Walter Chang, and Na-               ciation for Computational Linguistics, pages 5404–
  zli Goharian. 2018. A discourse-aware attention              5414.
  model for abstractive summarization of long docu-
  ments. In Proceedings of the 2018 Conference of            Yudai Kishimoto, Yugo Murawaki, and Sadao Kuro-
  the North American Chapter of the Association for            hashi. 2020. Adapting BERT to implicit discourse
  Computational Linguistics: Human Language Tech-              relation classification with a focus on discourse con-
  nologies, Volume 2 (Short Papers), pages 615–621,            nectives. In Proceedings of The 12th Language
  New Orleans, Louisiana. Association for Computa-             Resources and Evaluation Conference, pages 1152–
  tional Linguistics.                                          1158.

                                                       323
Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kami-                  Joint Conference on Natural Language Processing
  gaito, Manabu Okumura, and Masaaki Nagata. 2020.              (EMNLP-IJCNLP), pages 2284–2295.
  Top-down RST parsing utilizing granularity levels in
  documents. In Proceedings of the AAAI Conference            William C. Mann and Sandra A. Thompson. 1987.
  on Artificial Intelligence, volume 34, pages 8099–            Rhetorical Structure Theory: a theory of text
  8106.                                                         organization.     Technical Report RS-87-190,
                                                                USC/Information Sciences Institute. Reprint series.
Naoki Kobayashi, Tsutomu Hirao, Kengo Naka-
  mura, Hidetaka Kamigaito, Manabu Okumura, and               Daniel Marcu. 1999. Discourse trees are good indica-
  Masaaki Nagata. 2019. Split or merge: Which is                tors of importance in text. Advances in automatic
  better for unsupervised RST parsing? In Proceed-              text summarization, 293:123–136.
  ings of the 2019 Conference on Empirical Methods
  in Natural Language Processing and the 9th Inter-           Daniel Marcu. 2000. The Theory and Practice of Dis-
  national Joint Conference on Natural Language Pro-            course Parsing and Summarization. MIT press.
  cessing (EMNLP-IJCNLP), pages 5801–5806.
                                                              Mitchell Marcus, Beatrice Santorini, and Mary Ann
Murathan Kurfalı and Robert Östling. 2019. Zero-shot           Marcinkiewicz. 1993. Building a large annotated
 transfer for implicit discourse relation classification.       corpus of English: The Penn Treebank.
 In Proceedings of the 20th Annual SIGdial Meeting
 on Discourse and Dialogue, pages 226–231.                    Thomas Meyer, Andrei Popescu-Belis, Sandrine Zuf-
                                                                ferey, and Bruno Cartoni. 2011. Multilingual anno-
Haejun Lee, Drew A Hudson, Kangwook Lee, and                    tation and disambiguation of discourse connectives
  Christopher D Manning. 2020. SLM: Learning a                  for machine translation. In Association for Com-
  discourse language representation with sentence un-           putational Linguistics-Proceedings of 12th SIGdial
  shuffling. In Proceedings of the 2020 Conference on           Meeting on Discourse and Dialogue, CONF.
  Empirical Methods in Natural Language Processing
  (EMNLP), pages 1551–1562.                                   Mathieu Morey, Philippe Muller, and Nicholas Asher.
                                                               2017. How much progress have we made on RST
Li Liang, Zheng Zhao, and Bonnie Webber. 2020. Ex-             discourse parsing? a replication study of recent re-
   tending implicit discourse relation recognition to the      sults on the RST-DT.
   PDTB-3. In Proceedings of the First Workshop
   on Computational Approaches to Discourse, pages            Karthik Narasimhan and Regina Barzilay. 2015. Ma-
  135–147.                                                      chine comprehension with discourse relations. In
                                                                Proceedings of the 53rd Annual Meeting of the Asso-
Xiang Lin, Shafiq Joty, Prathyusha Jwalapuram, and              ciation for Computational Linguistics and the 7th In-
  M Saiful Bari. 2019. A unified linear-time frame-             ternational Joint Conference on Natural Language
  work for sentence-level discourse parsing. In Pro-            Processing (Volume 1: Long Papers), pages 1253–
  ceedings of the 57th Annual Meeting of the Asso-              1262.
  ciation for Computational Linguistics, pages 4190–
  4200.                                                       Noriki Nishida and Hideki Nakayama. 2020. Unsuper-
                                                                vised discourse constituency parsing using Viterbi
Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. 2009.                 EM. Transactions of the Association for Computa-
  Recognizing implicit discourse relations in the Penn          tional Linguistics, 8:215–230.
  Discourse Treebank. In Proceedings of the 2009
  Conference on Empirical Methods in Natural Lan-             Emily Pitler and Ani Nenkova. 2008. Revisiting read-
  guage Processing, pages 343–351.                              ability: A unified framework for predicting text qual-
                                                                ity. In Proceedings of the 2008 conference on empir-
Linlin Liu, Xiang Lin, Shafiq Joty, Simeng Han, and             ical methods in natural language processing, pages
  Lidong Bing. 2019. Hierarchical pointer net parsing.          186–195.
  In Proceedings of the 2019 Conference on Empirical
  Methods in Natural Language Processing and the              Rashmi Prasad, Eleni Miltsakaki, Nikhil Dinesh, Alan
  9th International Joint Conference on Natural Lan-            Lee, and Aravind Joshi. 2007. The Penn Discourse
  guage Processing (EMNLP-IJCNLP), pages 1006–                  Treebank 2.0 annotation manual.
  1016.
                                                              Rashmi Prasad, Bonnie Webber, and Alan Lee. 2018.
Yang Liu, Sujian Li, Xiaodong Zhang, and Zhifang                Discourse annotation in the PDTB: The next genera-
  Sui. 2016. Implicit discourse relation classification         tion. In Proceedings 14th Joint ACL - ISO Workshop
  via multi-task neural networks. In Proceedings of             on Interoperable Semantic Annotation, pages 87–97,
  the AAAI Conference on Artificial Intelligence, vol-          Santa Fe, New Mexico, USA. Association for Com-
  ume 30.                                                       putational Linguistics.

Amandla Mabona, Laura Rimell, Stephen Clark, and              Alan Lee Eleni Miltsakaki Livio Robaldo Aravind
 Andreas Vlachos. 2019. Neural generative rhetor-               Joshi Bonnie Webber Rashmi Prasad, Nikhil Dinesh.
 ical structure parsing.  In Proceedings of the                 2008. The Penn Discourse Treebank 2.0. Proceed-
 2019 Conference on Empirical Methods in Natu-                  ings of the Sixth International Conference on Lan-
 ral Language Processing and the 9th International              guage Resources and Evaluation (LREC’08).

                                                        324
RC Schank and RP Abelson. 1977. Plans, goals aid              Longyin Zhang, Yuqing Xing, Fang Kong, Peifeng Li,
  understanding: An inquiry into human knowledge                and Guodong Zhou. 2020. A top-down neural archi-
  structures.                                                   tecture towards text-level parsing of discourse rhetor-
                                                                ical structure. In Proceedings of the 58th Annual
Wei Shi and Vera Demberg. 2019. Next sentence pre-              Meeting of the Association for Computational Lin-
  diction helps implicit discourse relation classifica-         guistics, pages 6386–6395.
  tion within and across domains. In Proceedings of
  the 2019 Conference on Empirical Methods in Nat-            Yingxue Zhang, Fandong Meng, Peng Li, Ping Jian,
  ural Language Processing and the 9th International            and Jie Zhou. 2021. Context tracking network:
 Joint Conference on Natural Language Processing                Graph-based context modeling for implicit dis-
 (EMNLP-IJCNLP), pages 5794–5800.                               course relation recognition. In Proceedings of the
                                                                2021 Conference of the North American Chapter of
Maite Taboada. 2016.      Sentiment analysis:      An           the Association for Computational Linguistics: Hu-
 overview from linguistics.                                     man Language Technologies, pages 1592–1599.

                                                              Yang Zhong, Chao Jiang, Wei Xu, and Junyi Jessy Li.
Jianxiang Wang and Man Lan. 2015. A refined end-                2020. Discourse level factors for sentence deletion
   to-end discourse parser. In Proceedings of the Nine-         in text simplification. In Proceedings of the AAAI
   teenth Conference on Computational Natural Lan-              Conference on Artificial Intelligence, volume 34,
   guage Learning-Shared Task, pages 17–24.                     pages 9709–9716.
Yizhong Wang, Sujian Li, and Houfeng Wang. 2017.              Yuping Zhou, Jill Lu, Jennifer Zhang, and Nian-
  A two-stage parsing method for text-level discourse           wen Xue. 2014. Chinese Discourse Treebank 0.5
  analysis. In Proceedings of the 55th Annual Meet-             ldc2014t21. Web Download. Philadelphia: Linguis-
  ing of the Association for Computational Linguistics          tic Data Consortium.
  (Volume 2: Short Papers), pages 184–188.

Florian Wolf and Edward Gibson. 2005. Representing
   discourse coherence: A corpus-based study. Compu-
   tational linguistics, 31(2):249–287.

Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.
   2019. Discourse-aware neural extractive text sum-
   marization. arXiv preprint arXiv:1910.14142.

Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, Rashmi
  Prasad, Christopher Bryant, and Attapol Rutherford.
  2015. The CoNLL-2015 shared task on shallow
  discourse parsing. In Proceedings of the Nine-
  teenth Conference on Computational Natural Lan-
  guage Learning - Shared Task, pages 1–16, Beijing,
  China. Association for Computational Linguistics.

Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, At-
  tapol Rutherford, Bonnie Webber, Chuan Wang, and
  Hongmin Wang. 2016. ConLL 2016 shared task on
  multilingual shallow discourse parsing. In Proceed-
  ings of the CoNLL-16 shared task, pages 1–19.

Amir Zeldes. 2017. The GUM corpus: Creating mul-
 tilayer resources in the classroom. Language Re-
 sources and Evaluation, 51(3):581–612.

Deniz Zeyrek, Amália Mendes, Yulia Grishina, Mu-
  rathan Kurfalı, Samuel Gibbon, and Maciej Ogrod-
  niczuk. 2020. TED Multilingual Discourse Bank
  (TED-MDB): A parallel corpus annotated in the
  PDTB style. Language Resources and Evaluation,
  54:587–613.

Deniz Zeyrek and Bonnie Webber. 2008. A discourse
  resource for Turkish: Annotating discourse connec-
  tives in the METU corpus. In Proceedings of the 6th
  workshop on Asian language resources.

                                                        325
You can also read