Hierarchical Graph Convolutional Networks for Jointly Resolving Cross-document Coreference of Entity and Event Mentions

Page created by Sylvia Alvarado
 
CONTINUE READING
Hierarchical Graph Convolutional Networks for Jointly Resolving
           Cross-document Coreference of Entity and Event Mentions
                   Duy Phung1 , Tuan Ngo Nguyen2 and Thien Huu Nguyen2
                                   1
                                     VinAI Research, Vietnam
                      2
                        Department of Computer and Information Science,
                          University of Oregon, Eugene, Oregon, USA
                v.duypv1@vinai.io,{tnguyen,thien}@cs.uoregon.edu

                        Abstract                                        A major challenge in CDECR involves the
                                                                     necessity to model entity mentions (e.g., “Jim
      This paper studies the problem of cross-                       O’Brien”) that participate into events and reveal
      document event coreference resolution                          their spatio-temporal information (Yang et al.,
      (CDECR) that seeks to determine if event                       2015) (called event arguments). In particular, as
      mentions across multiple documents refer to
                                                                     event mentions might be presented in different sen-
      the same real-world events. Prior work has
      demonstrated the benefits of the predicate-                    tences/documents, an important evidence for pre-
      argument information and document context                      dicting the coreference of two event mentions is to
      for resolving the coreference of event men-                    realize that the two event mentions have the same
      tions. However, such information has not been                  participants in the real world and/or occur at the
      captured effectively in prior work for CDECR.                  same location and time (i.e., same arguments).
      To address these limitations, we propose a                        Motivated by this intuition, prior work for
      novel deep learning model for CDECR that
                                                                     CDECR has attempted to jointly resolve cross-
      introduces hierarchical graph convolutional
      neural networks (GCN) to jointly resolve                       document coreference for entities and events so
      entity and event mentions. As such, sentence-                  the two tasks can mutually benefit from each other
      level GCNs enable the encoding of important                    (iterative clustering) (Lee et al., 2012). In fact,
      context words for event mentions and their                     this iterative and joint modeling approach has re-
      arguments while the document-level GCN                         cently led to the state-of-the-art performance for
      leverages the interaction structures of event                  CDECR (Barhom et al., 2019; Meged et al., 2020).
      mentions and arguments to compute document
                                                                     Our model for CDECR follows this joint coref-
      representations to perform CDECR. Extensive
      experiments are conducted to demonstrate the                   erence resolution method; however, we advance
      effectiveness of the proposed model.                           it by introducing novel techniques to address two
                                                                     major limitations from previous work (Yang et al.,
 1    Introduction                                                   2015; Kenyon-Dean et al., 2018; Barhom et al.,
                                                                     2019), i.e., the inadequate mechanisms to capture
 Event coreference resolution (ECR) aims to clus-                    the argument-related information for representing
 ter event-triggering expressions in text such that                  event mentions and the use of only lexical features
 all event mentions in a group refer to the same                     to represent input documents.
 unique event in real world. We are interested in                       As the first limitation with the event argument-
 cross-document ECR (CDECR) where event men-                         related evidence, existing methods for CDECR
 tions might appear in the same or different docu-                   have mainly captured the direct information of
 ments. For instance, consider the following sen-                    event arguments for event mention representa-
 tences (event mentions) S1 and S2 that involve                      tions, thus failing to explicitly encode other im-
 “leaving” and “left” (respectively) as event trigger                portant context words in the sentences to reveal
 words (i.e., the predicates):                                       fine-grained nature of relations between arguments
    S1: O’Brien was forced into the drastic step of leaving          and triggers for ECR (Yang et al., 2015; Barhom
 the 76ers.                                                          et al., 2019). For instance, consider the corefer-
    S2: Jim O’Brien left the 76ers after one season as coach.        ence prediction between the event mentions in S1
    An CDECR system in this case would need to                       and the following sentence S3 (with “leave” as the
 recognize that both event mentions in S1 and S2                     event trigger word):
 refer to the same event.                                               S3: The baseball coach Jim O’Brien decided to leave the
                                                                32
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15), pages 32–41
                               June 11, 2021. ©2021 Association for Computational Linguistics
s
                      forced
                                            prep
                                                                                         work on CDECR has proved that input documents
             jpas

                      auxpass
         nsub
                                                                                         also provide useful context information for event
   O’Brien                                          into
                       was                                                               mentions (e.g., document topics) to enhance the

                                                          dep
                                                    step
                                                                prep
                                                                                         clustering performance (Kenyon-Dean et al., 2018).
                                      det
                                                                                         However, the document information is only cap-

                                                    dep
                                the                drastic               of
                                                                           pcomp         tured via lexical features in prior work, e.g., TF-
                                                                       leaving
                                                                              dep
                                                                                         IDF vectors (Kenyon-Dean et al., 2018; Barhom
                                                                        76er             et al., 2019), leading to the poor generalization to
                                                                              dep
                                       (S1)                             the              unseen words/tokens and inability to encapsulate
                                                                                         latent semantic information for CDECR. To this
Figure 1: The pruned dependency tree for the event mention
in S1. The trigger is red while argument heads are blue.
                                                                                         end, we propose to learn distributed representation
                                                                                         vectors for input documents to enrich event men-
club on Thursday.                                                                        tion representations and improve the generalization
   The arguments for the event mention in S3 in-                                         of the models for CDECR. In particular, as entity
volves “The baseball coach Jim O’Brien”, “the                                            and event mentions are the main objects of inter-
club”, and “Monday”. If an entity coreference                                            est for CDECR, our motivation is to focus on the
resolution system considers the entity mention                                           context information from these objects to induce
pairs (“O’Brien” in S1, “The baseball coach Jim                                          document representations for the models. To im-
O’Brien” in S3) and (“the 76ers” in S1 and “the                                          plement this idea, we propose to represent input
club” in S3) as being coreferred, a CDECR sys-                                           documents via interaction graphs between their en-
tem, which only concerns event arguments and                                             tity and event mentions, serving as the structures to
their coreference information, would incorrectly                                         generate document representation vectors.
predict the coreference between the event mentions                                          Based on those motivations, we introduce a
in S1 and S3 in this case. However, if the CDECR                                         novel hierarchical graph convolutional neural net-
system further models the important words for the                                        work (GCN) that involves two levels of GCN mod-
relations between event triggers and arguments (i.e.,                                    els to learn representation vectors for the itera-
the words “was forced” in S1 and “decided” in S3),                                       tive and joint model for CDECR. In particular,
it can realize the unwillingness of the subject for                                      sentence-level GCNs will consume the pruned de-
the position ending event in S1 and the self intent                                      pendency trees to obtain context-enriched represen-
to leave the position for the event in S3. As such,                                      tation vectors for event and entity mentions while
this difference can help the system to reject the                                        a document-level GCN will be run over the entity-
event coreference for S1 and S3.                                                         event interaction graphs, leveraging the mention
   To this end, we propose to explicitly identify                                        representations from the sentence-level GCNs as
and capture important context words for event trig-                                      the inputs to generate document representations
gers and arguments in representation learning for                                        for CDECR. Extensive experiments show that the
CDECR. In particular, our motivation is based on                                         proposed model achieves the state-of-the-art res-
the shortest dependency paths between event trig-                                        olution performance for both entities and events
gers and arguments that have been used to reveal                                         on the ECB+ dataset. To our knowledge, this is
important context words for their relations (Li et al.,                                  the first work that utilizes GCNs and entity-event
2013; Sha et al., 2018; Veyseh et al., 2020a, 2021).                                     interaction graphs for coreference resolution.
As an example, Figure 1 shows the dependency tree
of S1 where the shortest dependency path between                                         2   Related Work
“O’Brien” and “leaving” can successfully include
the important context word “forced”. As such, for                                        ECR is considered as a more challenging task than
each event mention, we leverage the shortest de-                                         entity coreference resolution due to the more com-
pendency paths to build a pruned and argument-                                           plex structures of event mentions that require argu-
customized dependency tree to simultaneously con-                                        ment reasoning (Yang et al., 2015). Previous work
tain event triggers, arguments and the important                                         for within-document event resolution includes pair-
words in a single structure. Afterward, the struc-                                       wise classifiers (Ahn, 2006; Chen et al., 2009),
ture will be exploited to learn richer representation                                    spectral graph clustering methods (Chen and Ji,
vectors for CDECR.                                                                       2009), information propagation (Liu et al., 2014),
   Second, for document representations, previous                                        markov logic networks (Lu et al., 2016), and deep
                                                                                    33
learning (Nguyen et al., 2016). For only cross-                     (Cybulska and Vossen, 2014) to evaluate the mod-
document event resolution, prior work has consid-                   els in this work, the training phase directly utilizes
ered mention-pair classifiers for coreference that                  the golden topics of the documents while the test
use granularities of event slots and lexical features               phase applies the K-mean algorithm for document
of event mentions for the features (Cybulska and                    clustering as in (Barhom et al., 2019). Afterward,
Vossen, 2015b,a). Within- and cross-document                        given a topic t 2 T with the corresponding docu-
event coreference have also been solved simulta-                    ment subset Dt ⇢ D, our CDECR model initial-
neously in previous work (Lee et al., 2012; Bejan                   izes the entity and event cluster configurations Et0
and Harabagiu, 2010; Adrian Bejan and Harabagiu,                    and Vt0 (respectively) where: Et0 involves within-
2014; Yang et al., 2015; Choubey and Huang, 2017;                   document clusters of the entity mentions in the
Kenyon-Dean et al., 2018). The most related works                   documents in Dt , and Vt0 simply puts each event
to us involve the joint models for entity and event                 mention presented in Dt into its own cluster (lines 2
coreference resolution that use contextualized word                 and 3 in Algorithm 1). In the training phase, Et0 is
embeddings to capture the dependencies between                      obtained from the golden within-document corefer-
the two tasks and lead to the state-of-the-art perfor-              ence information of the entity mentions (to reduce
mance for CDECR (Lee et al., 2012; Barhom et al.,                   noise) while the within-document entity mention
2019; Meged et al., 2020).                                          clusters returned by Stanford CoreNLP (Manning
   Finally, regarding the modeling perspective, our                 et al., 2014) are used for Et0 in the test phase, fol-
work is related to the models that use GCNs to                      lowing (Barhom et al., 2019). For convenience, the
learn representation vectors for different NLP tasks,               sets of entity and event mentions in Dt are called
e.g., event detection (Lai et al., 2020; Veyseh et al.,             MtE and MtV respectively.
2019) and target opinion word extraction (Vey-
seh et al., 2020b), applying both sentence- and                     Algorithm 1 Training algorithm
document-level graphs (Sahu et al., 2019; Tran                       1: for t 2 T do
et al., 2020; Nan et al., 2020; Tran and Nguyen,                     2:     Et0      Within-doc clusters of entity mentions
                                                                     3:     Vt0      Singleton event mentions in MtV
2021; Nguyen et al., 2021). However, to our knowl-                   4:     k      1
edge, none of the prior work has employed GCNs                       5:     while 9 meaningful cluster-pair merge do
for ECR.                                                             6:         //Entities
                                                                     7:         Generate entity mention representations
                                                                        RE (mei , Vtk 1 ) for all mei 2 MtE
3   Model                                                            8:         Compute entity mention-pair coreference scores
                                                                        SE (mei , mej )
Given a set of input documents D, the goal of                        9:         Train RE and SE using the gold entity mention
CDECR is to cluster event mentions in the doc-                          clusters
                                                                    10:          Etk      Agglomeratively cluster MtE based on
uments of D according to their coreference. Our                         SE (mei , mej )
model for CDECR follows (Barhom et al., 2019)                       11:          //Events
that simultaneously clusters entity mentions in D                   12:          Generate event mention representations
                                                                        RV (mvi , Etk ) for all mvi 2 MtV
to benefit from the inter-dependencies between en-                  13:          Compute event mention-pair coreference scores
tities and events for coreference resolution. In this                   SV (mvi , mvj )
section, we will first describe the overall framework               14:          Train RV and SV using the gold entity mention
                                                                        clusters
of our iterative method for joint entity and event                  15:          Vtk      Agglomeratively cluster MtV based on
coreference resolution based on (Barhom et al.,                         SV (mvi , mvj )
                                                                    16:          k     k+1
2019) (a summary of the framework is given in                       17:     end while
Algorithm 1). The novel hierarchical GCN model                      18: end for
for inducing mention1 representation vectors will
be discussed afterward.                                                Given the initial configurations, our iterative al-
Iterative Clustering for CDECR: Following                           gorithm involves a sequence of clustering itera-
(Barhom et al., 2019), we first cluster the input                   tions, generating new cluster configurations Etk
document set D into different topics to improve                     and Vtk for entities and events (respectively) af-
the coreference performance (the set of document                    ter each iteration. As such, each iteration k per-
topics is called T ). As we use the ECB+ dataset                    forms two independent clustering steps where en-
    1
      We use mentions to refer to both event and entity men-        tity mentions are clustered first to produce Etk ,
tions.                                                              followed by event mention clustering to obtain
                                                               34
Vtk (alternating the clustering). Starting with the            est cluster-pair scores until all the scores are be-
entity clustering at iteration k, each entity men-             low a predefined threshold . In this way, the al-
tion mei is first transformed into a representa-               gorithm first focuses on high-precision merging
tion vector RE (mei , Vtk 1 ) (line 7 in Algorithm             operations and postpones less precise ones until
1) that is conditioned on not only the specific con-           more information is available. The cluster-pair
text of mei but also the current event cluster con-            score SC (ci , cj ) for two clusters ci and cj (mention
figuration Vtk 1 (i.e., to capture the event-entity            sets) at some algorithm step is based on averaging
inter-dependencies). Afterward, a scoring func-                mention    Plinkage coreference scores: SC (ci , cj ) =
                                                                   1
tion SE (mei , mej ) is used to compute the coref-             |ci ||cj |  mi 2ci ,mj 2cj S⇤ (mi , mj ) (Barhom et al.,
erence probability/score for each pair of entity               2019) where ⇤ can be E or V depending on whether
mentions, leveraging their mention representations             ci and cj are entity or event clusters (respectively).
RE (mei , Vtk 1 ) and RE (mej , Vtk 1 ) at the cur-
                                                               Mention Representations: Let m be a men-
rent iteration (RE and SE will be discussed in the
                                                               tion (event or entity) in a sentence W =
next section) (line 8). An agglomerative cluster-
                                                               w1 , w2 , . . . , wn of n words (wi is the i-th word)
ing algorithm then utilizes these coreference scores
                                                               where wa is the head word of m. To prepare W
to cluster the entity mentions, leading to a new
                                                               for the mention representation computation and
configuration Etk for entity mention clusters (line
                                                               achieve a fair comparison with (Barhom et al.,
10). Given Etk , the same procedure is applied to
                                                               2019), we first convert each word wi 2 W into
cluster the event mentions for iteration k, includ-
                                                               a vector xi using the ELMo embeddings (Peters
ing: (i) obtaining an event representation vector
                                                               et al., 2018). Here, xi is obtained by running the
RV (mvi , Etk ) for each event mention mvi based
                                                               pre-trained ELMo model over W and averaging
on the current entity cluster configuration Etk , (ii)
                                                               the hidden vectors for wi at the three layers in
computing coreference scores for the event mention
                                                               ELMo. This transforms W into a sequence of vec-
pairs via the scoring function SV (mvi , mvj ), and
                                                               tors X = x1 , x2 , . . . , xn for the next steps. The
(iii) performing agglomerative clustering for the
                                                               mention representations in our work are based on
event mentions to produce the new configuration
                                                               two major elements, i.e., the modeling of important
Vtk for event clusters (lines 12, 13, and 15).
                                                               context words for event triggers and arguments,
   Note that during the training process, the param-           and the induction of document presentations.
eters of the representation and scoring functions for
entities RE and SE (or for events with RV and SV )             (i) Modeling Important Context Words: A mo-
are updated/optimized after the coreference scores             tivation for representation learning in our model
are computed for all the entity (or event) mention             is to capture event arguments and important con-
pairs. This corresponds to lines 9 and 14 in Algo-             text words (for the relations between event triggers
rithm 1 (not used in the test phase). In particular,           and arguments) to enrich the event mention repre-
the loss function to optimize RE and SE in line                sentations. In this work, we employ a symmetric
9 is based on the cross-entropy over every pair of             intuition to compute a representation vector for
entity               (mei , mej ) in MtE : Lent,coref          an entity mention m, aiming to encode associated
   P mentionsent                             t        =
                                                               predicates (i.e., event triggers that accept m as an
      me 6=me   y ij  log  S E (m ei , mej )     (1
        i    j                                                 argument W ) and important context words (for the
 ent ) log(1
yij             SE (mei , mej ) where yij ent is a
                                                               relations between the entity mention and associ-
golden binary variable to indicate whether mei and             ated event triggers). As such, following (Barhom
mej corefer or not (symmetrically for Lenv,coref
                                           t                   et al., 2019), we first identify the attached argu-
to train RV and SV in line 14). Also, we use                   ments (if m is an event mention) or predicates
the predicted configurations Etk and Vtk 1 in the              (if m is an entity mention) in W for m using a
mention representation computation (instead of the             semantic role labeling (SRL) system. In particu-
golden clusters as in yij
                        ent for the loss functions)
                                                               lar, we focus on four semantic roles of interest:
during both the training and test phases to achieve            Arg0, Arg1, Location, and Time. For conve-
a consistency (Barhom et al., 2019).                           nience, let Am = {wi1 , . . . , wio } ⇢ W be the set
   Finally, we note that the agglomerative cluster-            of head words of the attached event arguments or
ing in each iteration of our model also starts with            event triggers for m in W based on the SRL sys-
the initial clusters as in Et0 and Vt0 , and greed-            tem and the four roles (o is the number of head
ily merges multiple cluster pairs with the high-               words). In particular, if m is an event mention,
                                                          35
Am would involve the head words of the entity                      resolution task (i.e., events or entities). For exam-
mentions that fill the four semantic roles for m                   ple, event coreference might need to weight the
in W . In contrast, if m is an entity mention, Am                  information from event arguments and important
would capture the head words of the event trig-                    context words more than those for entity coref-
gers that take m as an argument with one of the                    erence (Yang et al., 2015). To this end, we pro-
four roles in W . Afterward, to encode the impor-                  pose to apply two separate sentence-level GCN
tant context words for the relations between m and                 models for event and entity mentions, which share
the words in Am , we employ the shortest depen-                    the architecture but differ from the parameters, to
dency paths Pij between the head word wa of m                      enhance representation learning (called Gent and
and the words wij 2 Am . As such, starting with                    Gevn for entities and events respectively). In ad-
the dependency tree of G m = {N m , E m } of W                     dition, to introduce a new source of training sig-
(N m involves the words in W ), we build a pruned                  nals and promote the knowledge transfer between
tree Ĝ m = {N̂ m , Ê m } of G m to explicitly focus              the two GCN networks, we propose to regularize
on the mention m and its important context words                   the GCN models so they produce similar repre-
in the paths Pij . Concretely, the node set N̂ m of                sentation vectors for the same input sentence W .
Ĝ m contains all the words in the paths Pij (N̂ m =               In particular, we apply both GCN models Gent
S                                                                  and Gevn over the full dependency tree3 G m us-
   j=1..o Pij ) while the edge set Ê
                                      m preserves the

dependency connections in G of the words in
                                  m                                ing the ELMo-based vectors X for W as the input.
N̂ m (Ê m = {(a, b)|a, b 2 N̂ m , (a, b) 2 E m }). As             This produces the hidden vectors hent                      ent
                                                                                                                 1 , . . . , hn
such, the pruned tree Ĝ m helps to gather the rele-               and h1 , . . . , hn in the last layers of G and
                                                                          env           env                              ent

vant context words for the coreference resolution                  Gevn for W . Afterward, we compute two ver-
of m and organize them into a single dependency                    sions of representation vectors for W based on
graph, serving as a rich structure to learn a repre-               max-pooling: hent = max_pool(hent                         ent
                                                                                                                1 , . . . , hn ),
                                                                   henv = max_pool(henv      1 , . . . , hn ). Finally, the
                                                                                                           env
sentation vector for m2 (e.g., in Figure 1).
(ii) Graph Convolutional Networks: The context                     mean squared difference between hent and henv
words and structure in Ĝ m suggest the use of Graph               is introduced into the overall loss function to reg-
Convolutional Networks (GCN) (Kipf and Welling,                    ularize the models: Lreg   m = ||h
                                                                                                           ent    henv ||22 . As
2017; Nguyen and Grishman, 2018) to learn the                      this loss is specific to mention m, it will be com-
representation vector for m (called the sentence-                  puted independently for event and entity mentions
level GCN). In particular, the GCN model in this                   and added into the corresponding training loss
work involves several layers (i.e., L layers in our                (i.e., lines 9 or 14 in Algorithm 1). In partic-
case) to compute the representation vectors for the                ular, the overall training loss for entity mention
nodes in the graph Ĝ m at different abstract levels.              coreference resolution in line 9 of P       Algorithm 1 is:
                                                                                    ent,coref
The input vector h0v for the node v 2 N̂ m for GCN
                                                                     ent
                                                                   Lt = ↵ Lt   ent            +(1 ↵ ) me 2M E Lreg
                                                                                                         ent
                                                                                                                              me
                                                                                                                          t

is set to the corresponding ELMo-based vectors in                  while those for event mention coreference res-
X. After L layers, we obtain the hidden vectors                    olutionPis: Lenv    t    = ↵env Lenv,coref
                                                                                                          t            + (1
                                                                   ↵env ) mv 2M V Lreg    mv (line 14). Here, ↵
                                                                                                                        ent and
hLv (in the last layer) for the nodes v 2 N̂ . We
                                               m
                                                                                     t

call sent(m) = [hvm , max_pool(hv |v 2 N̂ m )]
                       L                L                          ↵ env  are the trade-off parameters.
the sentence-level GCN vector that will be used                    (iv) Document Representation: As motivated in
later to represent m (vm is the corresponding node                 the introduction, we propose to learn representa-
of the head word of m in N̂ m ).                                   tion vectors for input documents to enrich men-
(iii) GCN Interaction: Our discussion about the                    tion representations and improve the generaliza-
sentence-level GCN so far has been agnostic to                     tion of the models (over the lexical features for
whether m is an event or entity mention and the                    documents). As such, our principle is to employ
straightforward approach is to apply the same GCN                  entity and event mentions (the main objects of in-
model for both entity and event mentions. How-                     terest in CDECR) and their interactions/structures
ever, this approach might limit the flexibility of the             to represent input documents (i.e., documents as
GCN model to focus on the necessary aspects of                     interaction graphs of entity and event mentions).
information that are specific to each coreference                  Given the mention m of interest and its correspond-

   2                                                                   3
     If Am is empty, the pruned tree Ĝ m only contains the              We tried the pruned dependency tree Ĝ m in this regular-
head word of m.                                                    ization, but the full dependency tree G m led to better results.
                                                              36
P
ing document d 2 Dt , we start by building an                 mentions in c: Arg0m = 1/|c| q2c xhead(q)
interaction graph G doc = {N doc , E doc } where the          (head(q) is the index of the head word of mention
node set N doc involves all the entity and event men-         q in W ). Note that as the current entity cluster con-
tions in d. For the edge set E doc , we leverage two          figuration Etk is used to generate the cluster-based
types of information to connect the nodes in N doc :          representations if m is an event mention (and vice
(i) predicate-argument information: we establish              verse), it serves as the main mechanism to enable
a link between the nodes for an event mention x               the two coreference tasks to interact and benefit
and an entity mention y in N doc if y is an argu-             from each other. These vectors are inherited from
ment of x for one of the four semantic roles, i.e.,           (Barhom et al., 2019) for a fair comparison.
Arg0, Arg1, Location, and Time (identified                       Finally, given two mentions m1 and m2 , and
by the SRL system), and (ii) entity mention coref-            their corresponding representation vectors R(m1 )
erence: we connect any pairs of nodes in N doc that           and R(m2 ) (as computed above), the coreference
correspond to two coreferring entity mentions in              score functions SE and SV send the concatenated
d (using the gold within-document entity corefer-             vector [R(m1 ), R(m2 ), R(m1 ) R(m2 )] to two-
ence in training and the predicted one in testing,            layer feed-forward networks (separate ones for SE
as in Et0 ). In this way, G doc helps to emphasize on         and SV ) that involve the sigmoid function in the
important objects of d and enables intra- and inter-          end to produce coreference score for m1 and m2 .
sentence interactions between event mentions (via             Here, is the element-wise product. This com-
entity mentions/arguments) to produce effective               pletes the description of our CDECR model.
document representations for CDECR.
   In the next step, we feed the interaction graph            4   Experiments
G doc into a GCN model Gdoc (called the document-
                                                              Dataset: We use the ECB+ dataset (Cybulska and
level GCN) using the sentence-level GCN-based
                                                              Vossen, 2014) to evaluate the CDECR models in
representations of the entity and event mentions
                                                              this work. Note that ECB+ is the largest dataset
in N doc (i.e., sent(m)) as the initial vectors for
                                                              with both within- and cross-document annotation
the nodes (thus called a hierarchical GCN model).
                                                              for the coreference of entity and event mentions so
The hidden vectors produced by the last layer
                                                              far. We follow the setup and split for this dataset
of Gdoc for the nodes in G doc is called {hdu |u 2
                                                              in prior work to ensure a fair comparison (Cybul-
N doc }. Finally, we obtain the representation
                                                              ska and Vossen, 2014; Kenyon-Dean et al., 2018;
vector doc(m) for d based on the max-pooling:
                                                              Barhom et al., 2019; Meged et al., 2020). In partic-
doc(m) = max_pool(hdu |u 2 N doc ).
                                                              ular, this setup employs the annotation subset that
(v) Final Representation: Given the represen-                 has been validated for correctness by (Cybulska
tation vectors learned so far, we form the final              and Vossen, 2014) and involves a larger portion of
representation vector for m (RE (m, Vtk 1 ) or                the dataset for training. In ECB+, only a part of
RV (m, Etk ) in lines 7 or 12 of Algorithm 1) by              the mentions are annotated. This setup thus utilizes
concatenating the following vectors:                          gold-standard event and entity mentions in the eval-
  (1) The sentence- and document-level GCN-                   uation and does not require special treatment for
based representation vectors for m (i.e., sent(m)             unannotated mentions (Barhom et al., 2019).
and doc(m)).                                                     Note that there is a different setup for ECB+ that
   (2)   The        cluster-based       representation        is applied in (Yang et al., 2015) and (Choubey and
cluster(m) = [Arg0m , Arg1m , Locationm                       Huang, 2017). In this setup, the full ECB+ dataset
, Timem ]. Taking Arg0m as an example, it is                  is employed, including the portions with known
computed by considering the mention m0 that is                annotation errors. In test time, such prior work uti-
associated with m via the semantic role Arg0 in               lizes the predicted mentions from a mention extrac-
W using the SRL system. Here, m0 is an event                  tion tool (Yang et al., 2015). To handle the partial
mention if m is an entity mention and vice versa              annotation in ECB+, those prior work only eval-
(Arg0 = 0 if m0 does not exist). As such, let c               uates the systems on the predicted mentions that
be the cluster in the current configuration (i.e.,            are also annotated as the gold mentions. However,
Vtk 1 or Etk ) that contain m0 . We then obtain               as shown by (Upadhyay et al., 2016), this ECB+
Arg0m by averaging the ELMo-based vectors                     setup has several limitations (e.g., the ignorance
(i.e., X = x1 , . . . , xn ) of the head words of the         of clusters with a single mention and the separate
                                                         37
evaluation for each sub-topic). Following (Barhom             metrics to evaluate the models in this work, includ-
et al., 2019), we thus do not evaluate the systems            ing MUC (Vilain et al., 1995), B3 , CEAF-e (Luo,
on this setup, i.e., not comparing our model with             2005), and CoNLL F1 (average of three previous
those models in (Yang et al., 2015) and (Choubey              metrics). The official CoNLL scorer in (Pradhan
and Huang, 2017) due to the incompatibility.                  et al., 2014) is employed to compute these metrics.
Hyper-Parameters: To achieve a fair comparison,               Tables 1 and 2 show the performance (F1 scores)
we utilize the preprocessed data and extend the               of the models for cross-document resolution for
implementation for the model in (Barhom et al.,               entity and event mentions (respectively). Note that
2019) to include our novel hierarchical GCN model.            we also report the performance of a variant (called
The development dataset of ECB+ is used to tune               HGCN-DJ) of the proposed HGCN model where
the hyper-parameters of the proposed model (called            the cluster-based representations cluster(m) are
HGCN). The suggested values and the resources                 excluded (thus separately doing event and entity
for our model are reported in Appendix A.                     resolution as BSE-DJ).
Comparison: Following (Barhom et al., 2019), we
                                                               Model                              MUC      B3    CEAF-e   CoNLL
compare HGCN with the following baselines:                     LEMMA (Barhom et al., 2019)        78.1    77.8    73.6     76.5
                                                               CV (Cybulska and Vossen, 2015b)    73.0    74.0    64.0     73.0
   (i) LEMMA (Barhom et al., 2019): This first                 KCP (Kenyon-Dean et al. 2019)      69.0    69.0    69.0     69.0
                                                               CKCP (Barhom et al., 2019)         73.4    75.9    71.5     73.6
clusters documents to topics and then groups event             BSE-DJ (Barhom et al., 2019)       79.4    80.4    75.9     78.5
mentions that are in the same document clusters                BSE (Barhom et al., 2019)
                                                               MCS (Meged et al., 2020)
                                                                                                  80.9
                                                                                                  81.6
                                                                                                          80.3
                                                                                                          80.5
                                                                                                                  77.3
                                                                                                                  77.8
                                                                                                                           79.5
                                                                                                                           80.0
and share the head lemmas.                                     HGCN-DJ                            81.3    80.8    76.1     79.4
                                                               HGCN (proposed)                    83.1    82.1    78.8     81.3
   (ii) CV (Cybulska and Vossen, 2015b): This is a
supervised learning method for CDECR that lever-              Table 1: The cross-document event coreference resolu-
ages discrete features to represent event mentions            tion performance (F1) on the ECB+ test set.
and documents. We compare with the best reported
                                                               Model                             MUC      B3     CEAF-e   CoNLL
results for this method as in (Barhom et al., 2019).           LEMMA (Barhom et al., 2019)       76.7    65.6     60.0     67.4
                                                               BSE-DJ (Barhom et al., 2019)      78.7    69.9     61.6     70.0
   (iii) KCP (Kenyon-Dean et al., 2018): This is               BSE (Barhom et al., 2019)         79.7    70.5     63.3     71.2
                                                               HGCN-DJ                           80.1    70.7     62.2     71.0
a neural network model for CDECR. Both event                   HGCN (proposed)                   82.1    71.7     63.4     72.4
mentions and document are represented via word
embeddings and hand-crafted binary features.                  Table 2: The cross-document entity coreference resolu-
   (iv) C-KCP (Barhom et al., 2019): This is the              tion performance (F1) on the ECB+ test set.
KCP model that is retrained and tuned using the                  As can be seen, HGCN outperforms the all the
same document clusters in the test phase as in                baselines models on both entity and event coref-
(Barhom et al., 2019) and our model.                          erence resolution (over different evaluation met-
   (v) BSE (Barhom et al., 2019): This is a joint res-        rics). In particular, the CoNLL F1 score of HGCN
olution model for cross-document coreference of               for event coreference is 1.3% better than those for
entity and event mentions, using ELMo to compute              MCS (the prior SOTA model) while the CoNLL
representations for the mentions.                             F1 improvement of HGCN over BSE (the prior
   (v) BSE-DJ (Barhom et al., 2019): This is a                SOTA model for entity coreference on ECB+) is
variant of BSE that does not use the cluster-based            1.2%. These performance gaps are significant with
representations cluster(m) in the mention repre-              p < 0.001, thus demonstrating the effectiveness
sentations, thus performing event and entity coref-           of the proposed model for CDECR. Importantly,
erence resolution separately.                                 HGCN is significantly better than BSE, the most
   (vii) MCS (Meged et al., 2020): An extension of            direct baseline of the proposed model, on both en-
BSE where some re-ranking features are included.              tity and event coreference regardless of whether
Note that BSE and MCS are the current state-of-               the cluster-based representations cluster(m) for
the-art (SOTA) models for CDECR on ECB+.                      joint entity and event resolution is used or not. This
   For cross-document entity coreference resolu-              testifies to the benefits of the proposed hierarchi-
tion, we compare our model with the LEMMA                     cal model for representation learning for CDECR.
and BSE models in (Barhom et al., 2019), the only             Finally, we evaluate the full HGCN model when
works that report the performance for event men-              ELMo embeddings are replaced with BERT embed-
tions on ECB+ so far. Following (Barhom et al.,               dings (Devlin et al., 2019), leading to the CoNLL
2019), we use the common coreference resolution               F1 scores of 79.7% and 72.3% for event and entity
                                                         38
coreference (respectively). This performance is ei-              It is clear from the table that all the ablated
ther worse (for events) or comparable (for entities)          baselines are significantly worse than the full
than those for EMLo, thus showing the advantages              model HGCN (with p < 0.001), thereby confirm-
of ELMo for our tasks on ECB+.                                ing the necessity of the proposed GCN models
Ablation Study: To demonstrate the benefits of                (the sentence-level GCNs with pruned trees and
the proposed components for our CDECR model,                  document-level GCN) and the GCN interaction
we evaluate three groups of ablated/varied models             mechanism for HGCN and CDECR. In addition,
for HGCN. First, for the effectiveness of the pruned          the same trends for the model performance are
dependency tree Ĝ m and the sentence-level GCN               also observed for entity coreference in this ablation
models Gent and Genv , we consider the following              study (the results are shown in Appendix B), thus
baselines: (i) HGCN-Sentence GCNs: this model                 further demonstrating the benefits of the proposed
removes the sentence-level GCN models Gent and                components in this work. Finally, we show the
Genv from HGCN and directly feed the ELMo-                    distribution of the error types of our HGCN model
based vectors X of the mention heads into the                 in Appendix C for future improvement.
document-level GCN Gdoc (the representation vec-
tor sent(m) is thus not included), and (ii) HGCN-             5   Conclusion
Pruned Tree: this model replaces the pruned de-               We present a model to jointly resolve the cross-
pendency tree Ĝ m with the full dependency tree              document coreference of entity and event mentions.
G m in the computation. Second, for the advantage             Our model introduces a novel hierarchical GCN
of the GCN interaction between Gent and Genv ,                that captures both sentence and document context
we examine two baselines: (iii) HGCN with One                 for the representations of entity and event mentions.
Sent GCN: this baseline only uses one sentence-               In particular, we design pruned dependency trees
level GCN model for both entity and event men-                to capture important context words for sentence-
tions (the regularization loss Lreg
                                  m for G
                                             ent and
                                                              level GCNs while interaction graphs between entity
G  env  is thus not used as well), and (iv) HGCN-             and event mentions are employed for document-
  reg
Lm : this baseline still uses two sentence-level              level GCN. In the future, we plan to explore better
GCN models but excludes the regularization term               mechanisms to identify important context words
Lreg
  m from the training losses. Finally, for the bene-          for CDECR.
fits of the document-level GCN Gdoc , we study the
following variants: (v) HGCN-Gdoc : this model                Acknowledgments
removes the document-level GCN model from
HGCN, thus excluding the document representa-                 This research has been supported by the Army Re-
tions doc(m) from the mention representations, (vi)           search Office (ARO) grant W911NF-21-1-0112.
HGCN-Gdoc +TFIDF: this model also excludes                    This research is also based upon work supported by
Gdoc , but it includes the TF-IDF vectors for doc-            the Office of the Director of National Intelligence
uments, based on uni-, bi- and tri-grams, in the              (ODNI), Intelligence Advanced Research Projects
mention representations (inherited from (Kenyon-              Activity (IARPA), via IARPA Contract No. 2019-
Dean et al., 2018)), and (vii) HGCN-Gdoc +MP: in-             19051600006 under the Better Extraction from Text
stead of using the GCN model Gdoc , this model ag-            Towards Enhanced Retrieval (BETTER) Program.
gregates mention representation vectors produced              The views and conclusions contained herein are
by the sentence-level GCNs (sent(m)) to obtain                those of the authors and should not be interpreted
the document representations doc(m) using max-                as necessarily representing the official policies, ei-
pooling. Table 3 presents the performance of the              ther expressed or implied, of ARO, ODNI, IARPA,
models for event coreference on the ECB+ test set.            the Department of Defense, or the U.S. Govern-
                                                              ment. The U.S. Government is authorized to re-
 Model                    MUC     B3    CEAF-e   CoNLL
 HGCN (full)              83.1   82.1    78.8     81.3
                                                              produce and distribute reprints for governmental
 HGCN-Sentence GCNs       79.7   80.7    76.4     79.0        purposes notwithstanding any copyright annotation
 HGCN-Pruned Tree         81.8   81.1    77.1     80.0
 HGCN with One Sent GCN   82.1   81.0    76.3     79.8        therein. This document does not contain technol-
 HGCN-Lreg
        m                 81.6   81.6    77.6     80.3        ogy or technical data controlled under either the
 HGCN-Gdoc                82.6   81.8    76.7     80.4
 HGCN-Gdoc +TFIDF         82.2   81.0    78.4     80.5        U.S. International Traffic in Arms Regulations or
 HGCN-Gdoc +MP            82.0   81.1    77.0     80.0        the U.S. Export Administration Regulations.
Table 3: The CDECR F1 scores on the ECB+ test set.
                                                         39
References                                                      Thomas N. Kipf and Max Welling. 2017. Semi-
                                                                  supervised classification with graph convolutional
Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un-                networks. In ICLR.
  supervised event coreference resolution. In Compu-
  tational Linguistics.                                         Viet Dac Lai, Tuan Ngo Nguyen, and Thien Huu
                                                                  Nguyen. 2020. Event detection: Gate diversity and
David Ahn. 2006. The stages of event extraction. In               syntactic importance scores for graph convolution
  Proceedings of the Workshop on Annotating and                   neural networks. In EMNLP.
  Reasoning about Time and Events.
                                                                Heeyoung Lee, Marta Recasens, Angel Chang, Mihai
Shany Barhom, Vered Shwartz, Alon Eirew, Michael                  Surdeanu, and Dan Jurafsky. 2012. Joint entity and
  Bugert, Nils Reimers, and Ido Dagan. 2019. Re-                  event coreference resolution across documents. In
  visiting joint modeling of cross-document entity and            EMNLP.
  event coreference resolution. In ACL.
                                                                Qi Li, Heng Ji, and Liang Huang. 2013. Joint event
Cosmin Bejan and Sanda Harabagiu. 2010. Unsuper-
                                                                  extraction via structured prediction with global fea-
  vised event coreference resolution with rich linguis-
                                                                  tures. In ACL.
  tic features. In ACL.

Zheng Chen and Heng Ji. 2009. Graph-based event                 Zhengzhong Liu, Jun Araki, Eduard Hovy, and Teruko
  coreference resolution. In Proceedings of the 2009              Mitamura. 2014. Supervised within-document event
  Workshop on Graph-based Methods for Natural Lan-                coreference using information propagation.    In
  guage Processing (TextGraphs-4).                                LREC.

Zheng Chen, Heng Ji, and Robert Haralick. 2009. A               Jing Lu, Deepak Venugopal, Vibhav Gogate, and Vin-
  pairwise event coreference model, feature impact                 cent Ng. 2016. Joint inference for event coreference
  and evaluation for event coreference resolution. In              resolution. In COLING.
  Proceedings of the Workshop on Events in Emerging
  Text Types.                                                   Xiaoqiang Luo. 2005. On coreference resolution per-
                                                                  formance metrics. In EMNLP.
Prafulla Kumar Choubey and Ruihong Huang. 2017.
  Event coreference resolution by iteratively unfold-           Christopher Manning, Mihai Surdeanu, John Bauer,
  ing inter-dependencies among events. In EMNLP.                  Jenny Finkel, Steven Bethard, and David McClosky.
                                                                  2014. The Stanford CoreNLP natural language pro-
Agata Cybulska and Piek Vossen. 2014. Using a                     cessing toolkit. In ACL System Demonstrations.
  sledgehammer to crack a nut? lexical diversity and
  event coreference resolution. In LREC.                        Yehudit Meged, Avi Caciularu, Vered Shwartz, and
                                                                  Ido Dagan. 2020. Paraphrasing vs coreferring: Two
Agata Cybulska and Piek Vossen. 2015a. Translat-                  sides of the same coin. In ArXiv abs/2004.14979.
  ing granularity of event slots into features for event
  coreference resolution. In Proceedings of the The             Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu.
  3rd Workshop on EVENTS: Definition, Detection,                  2020. Reasoning with latent structure refinement for
  Coreference, and Representation.                                document-level relation extraction. In ACL.

Agata Cybulska and Piek T. J. M. Vossen. 2015b. "bag            Minh Van Nguyen, Viet Dac Lai, and Thien Huu
  of events" approach to event coreference resolution.            Nguyen. 2021. Cross-task instance representation
  supervised classification of event templates. In Int.           interactions and label dependencies for joint in-
  J. Comput. Linguistics Appl.                                    formation extraction with graph convolutional net-
                                                                  works. In NAACL-HLT.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of              Thien Huu Nguyen, , Adam Meyers, and Ralph Grish-
   deep bidirectional transformers for language under-            man. 2016. New york university 2016 system for
   standing. In NAACL-HLT.                                        kbp event nugget: A deep learning approach. In
                                                                  TAC.
Matthew Honnibal and Ines Montani. 2017. spaCy 2:
 Natural language understanding with Bloom embed-               Thien Huu Nguyen and Ralph Grishman. 2018. Graph
 dings, convolutional neural networks and incremen-               convolutional networks with argument-aware pool-
 tal parsing. In To appear.                                       ing for event detection. In AAAI.

Kian Kenyon-Dean, Jackie Chi Kit Cheung, and Doina              F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
  Precup. 2018. Resolving event coreference with                   B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
  supervised representation learning and clustering-               R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
  oriented regularization. In Proceedings of the Sev-              D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
  enth Joint Conference on Lexical and Computa-                    esnay. 2011. Scikit-learn: Machine learning in
  tional Semantics.                                                Python. In Journal of Machine Learning Research.
                                                           40
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt                  Marc Vilain, John Burger, John Aberdeen, Dennis Con-
 Gardner, Christopher Clark, Kenton Lee, and Luke                 nolly, and Lynette Hirschman. 1995. A model-
 Zettlemoyer. 2018. Deep contextualized word repre-               theoretic coreference scoring scheme. In Sixth Mes-
 sentations. In NAACL-HLT.                                        sage Understanding Conference (MUC-6).

Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed-               Bishan Yang, Claire Cardie, and Peter Frazier. 2015. A
  uard Hovy, Vincent Ng, and Michael Strube. 2014.                 hierarchical distance-dependent Bayesian model for
  Scoring coreference partitions of predicted men-                 event coreference resolution. In TACL.
  tions: A reference implementation. In ACL.

Sunil Kumar Sahu, Fenia Christopoulou, Makoto
  Miwa, and Sophia Ananiadou. 2019. Inter-sentence
  relation extraction with document-level graph convo-
  lutional neural network. In ACL.

Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui.
  2018. Jointly extracting event triggers and argu-
  ments by dependency-bridge rnn and tensor-based
  argument interaction. In AAAI.

M. Surdeanu, Lluís Màrquez i Villodre, X. Carreras,
  and P. Comas. 2007. Combination strategies for se-
  mantic role labeling. In Journal of Artificial Intelli-
  gence Research.

Hieu Minh Tran, Minh Trung Nguyen, and Thien Huu
  Nguyen. 2020. The dots have their values: Exploit-
  ing the node-edge connections in graph-based neu-
  ral models for document-level relation extraction. In
  EMNLP Findings.

Minh Tran and Thien Huu Nguyen. 2021. Graph con-
  volutional networks for event causality identifica-
  tion with rich document-level structures. In NAACL-
  HLT.

Shyam Upadhyay,          Nitish Gupta,    Christos
  Christodoulopoulos, and Dan Roth. 2016. Re-
  visiting the evaluation for cross document event
  coreference. In COLING.

Amir Pouran Ben Veyseh, Franck Dernoncourt,
 Quan Hung Tran, Varun Manjunatha, Lidan Wang,
 Rajiv Jain, Doo Soon Kim, Walter Chang, and
 Thien Huu Nguyen. 2021. Inducing rich interac-
 tion structures between words for document-level
 event argument extraction. In Proceedings of the
 25th Pacific-Asia Conference on Knowledge Discov-
 ery and Data Mining (PAKDD).

Amir Pouran Ben Veyseh, Thien Huu Nguyen, and De-
 jing Dou. 2019. Graph based neural networks for
 event factuality prediction using syntactic and se-
 mantic structures. In ACL.

Amir Pouran Ben Veyseh, Tuan Ngo Nguyen, and
 Thien Huu Nguyen. 2020a. Graph transformer net-
 works with syntactic and semantic structures for
 event argument extraction. In EMNLP Findings.

Amir Pouran Ben Veyseh, Nasim Nouri, Franck Der-
 noncourt, Dejing Dou, and Thien Huu Nguyen.
 2020b. Introducing syntactic structures into target
 opinion word extraction with deep learning. In
 EMNLP.
                                                            41
You can also read