Capturing Event Argument Interaction via A Bi-Directional Entity-Level Recurrent Decoder

Page created by Lewis Anderson
 
CONTINUE READING
Capturing Event Argument Interaction via A Bi-Directional Entity-Level
 Recurrent Decoder
 Xi Xiangyu1,2 , Wei Ye1,†, Shikun Zhang1,†, Quanxiu Wang3 , Huixing Jiang2 , Wei Wu2
 1
 National Engineering Research Center for Software Engineering, Peking University,
 Beijing, China
 2
 Meituan Group, Beijing, China
 3
 RICH AI, Beijing, China
 {xixy,wye,zhangsk}@pku.edu.cn

 Abstract achieved significant progress(Chen et al., 2015;
 Nguyen et al., 2016; Sha et al., 2018; Yang et al.,
 Capturing interactions among event arguments 2019; Wang et al., 2019b; Zhang et al., 2020; Du
 is an essential step towards robust event argu-
 and Cardie, 2020). Many efforts have been devoted
 ment extraction (EAE). However, existing ef-
arXiv:2107.00189v1 [cs.CL] 1 Jul 2021

 forts in this direction suffer from two limita-
 to improving EAE by better characterizing argu-
 tions: 1) The argument role type information ment interaction, categorized into two paradigms.
 of contextual entities is mainly utilized as train- The first one, named inter-event argument inter-
 ing signals, ignoring the potential merits of action in this paper, concentrates on mining infor-
 directly adopting it as semantically rich input mation of the target entity (candidate argument)
 features; 2) The argument-level sequential se- in the context of other event instances (Yu et al.,
 mantics, which implies the overall distribution 2011; Nguyen et al., 2016), e.g., the evidence that
 pattern of argument roles over an event men-
 a Victim argument for the Die event is often the
 tion, is not well characterized. To tackle the
 above two bottlenecks, we formalize EAE as Target argument for the Attack event in the same
 a Seq2Seq-like learning problem for the first sentence. The second one is intra-event argu-
 time, where a sentence with a specific event ment interaction, which exploits the relationship
 trigger is mapped to a sequence of event ar- of the target entity with others in the same event
 gument roles. A neural architecture with a instance (Yu et al., 2011; Sha et al., 2016, 2018).
 novel Bi-directional Entity-level Recurrent De- We focus on the second paradigm in this paper.
 coder (BERD) is proposed to generate argu-
 Despite their promising results, existing methods
 ment roles by incorporating contextual enti-
 ties’ argument role predictions, like a word- on capturing intra-event argument interaction suffer
 by-word text generation process, thereby dis- from two bottlenecks.
 tinguishing implicit argument distribution pat- (1) The argument role type information of
 terns within an event more accurately. contextual entities is underutilized. As two
 representative explorations, dBRNN (Sha et al.,
 1 Introduction 2018) uses an intermediate tensor layer to capture
 latent interaction between candidate arguments;
 Event argument extraction (EAE), which aims to
 RBPB (Sha et al., 2016) estimates whether two
 identify the entities serving as event arguments and
 candidate argument belongs to one event or not,
 classify the roles they play in an event, is a key step
 serving as constraints on a Beam-Search-based pre-
 towards event extraction (EE). For example, given
 diction algorithm. Generally, these works use the
 that the word “fired” triggers an Attack event in
 argument role type information of contextual enti-
 the sentence “In Baghdad, a cameraman died when
 ties as auxiliary supervision signals for training
 an American tank fired on the Palestine Hotel” ,
 to refine input representation. However, one intu-
 EAE need to identify that “Baghdad”, “camera-
 itive observation is that the argument role types
 man”, “American tank”, and “Palestine hotel” are
 can be utilized straightforwardly as semantically
 arguments with Place, Target, Instrument, and Tar-
 rich input features, like how we use entity type
 get as roles respectively.
 information. To verify this intuition, we conduct
 Recently, deep learning models have been
 an experiment on ACE 2005 English corpus, in
 widely applied to event argument extraction and
 which CNN (Nguyen and Grishman, 2015) is uti-
 †
 Corresponding authors. lized as a baseline. For an entity, we incorporate
the ground-truth roles of its contextual arguments or previous entity recurrently like a text generation
into the baseline model’s input representation, ob- process. In this way, BERD can identify candidate
taining model CNN(w. role type). As expected, arguments in a way that is more consistent with the
CNN(w. role type) outperforms CNN significantly implicit distribution pattern of multiple argument
as shown in Table 11 . roles within a sentence, similar to text generation
 models that learn to generate word sequences fol-
 Model P R F1 lowing certain grammatical rules or text styles.
 CNN 57.8 55.0 56.3 The contributions of this paper are:
 CNN(w. role type) 59.8 60.3 60.0
 1. We formalize the task of event argument ex-
Table 1: Experimental results of CNN and its variant traction as a Seq2Seq-like learning problem
on ACE 2005. for the first time, where a sentence with a spe-
 cific event trigger is mapped to a sequence of
 The challenge of the method lies in knowing the event argument roles.
ground-truth roles of contextual entities in the infer-
 2. We propose a novel architecture with a Bi-
ence (or testing) phase. That is one possible reason
 directional Entity-level Recurrent Decoder
why existing works do not investigate in this direc-
 (BERD) that is capable of leveraging the ar-
tion. Here we can simply use predicted argument
 gument role predictions of left- and right-side
roles to approximate corresponding ground truth
 contextual entities and distinguishing argu-
for inference. We believe that the noise brought by
 ment roles’ overall distribution pattern.
prediction is tolerable, considering the stimulating
effect of using argument roles directly as input. 3. Extensive experimental results show that our
 (2) The distribution pattern of multiple argu- proposed method outperforms several compet-
ment roles within an event is not well character- itive baselines on the widely-used ACE 2005
ized. For events with many entities, distinguishing dataset. BERD’s superiority is more signifi-
the overall appearance patterns of argument roles cant given more entities in a sentence.
is essential to make accurate role predictions. In
dBRNN (Sha et al., 2018), however, there is no 2 Problem Formulation
specific design involving constraints or interaction Most previous works formalize EAE as either a
among multiple prediction results, though the argu- word-level sequence labeling problem (Nguyen
ment representation fed into the final classifier is et al., 2016; Zeng et al., 2016; Yang et al., 2019)
enriched with synthesized information (the tensor or an entity-oriented classic classification problem
layer) from other arguments. RBPB (Sha et al., (Chen et al., 2015; Wang et al., 2019b). We for-
2016) explicitly leverages simple correlations in- malize EAE as a Seq2Seq-like learning problem as
side each argument pair, ignoring more complex in- follows. Let S = {w1 , ..., wn } be a sentence where
teractions in the whole argument sequence. There- n is the sentence length and wi is the i-th token.
fore, we need a more reliable way to learn the Also, let E = {e1 , ..., ek } be the entity mentions in
sequential semantics of argument roles in an event. the sentence where k is number of entities. Given
 To address the above two challenges, we for- that an event triggered by t ∈ S is detected in ED
malize EAE as a Seq2Seq-like learning problem stage , EAE need to map the sentence with the event
(Bahdanau et al., 2014) of mapping a sentence with to a sequence of argument roles R = {y1 , ..., yk },
a specific event trigger to a sequence of event ar- where yi denotes the argument role that entity ei
gument roles. To fully utilize both left- and right- plays in the event.
side argument role information, inspired by the bi-
directional decoder for machine translation (Zhang 3 The Proposed Approach
et al., 2018), we propose a neural architecture with We employ an encoder-decoder architecture for the
a novel Bi-directional Entity-level Recurrent De- problem defined above, which is similar to most
coder (BERD) to generate event argument roles Seq2Seq models in machine translation (Vaswani
entity by entity. The predicted argument role of an et al., 2017; Zhang et al., 2018), automatic text
entity is fed into the decoding module for the next summarization (Song et al., 2019; Shi et al., 2021),
 1
 In the experiment we skip the event detection phase and and speech recognition (Tüske et al., 2019; Hannun
directly assume all the triggers are correctly recognized. et al., 2019) from a high-level perspective.
 
 E c d
 A Cla i e Cla i e Cla i e Cla i e

 ca e a a

 d ed

 e 
 a Bac a d
 D c d 
 a e ca

 a

 ed Cla i e Cla i e Cla i e Cla i e
 U da e U da e U da e
 
 e

 Pa e e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e
 E ac E ac E ac E ac E ac E ac E ac E ac
 H e
 1 2 3 4 Start
 #

 ATTACK F ad
 #
 D c d 
 [SEP]

 Cla i e Cla i e Cla i e Cla i e
 U da e
 U da e
 U da e
 
 Be

 Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e
 E ac E ac E ac E ac E ac E ac E ac E ac
 W dP ece Se e P
 + + Start 1 2 3 4

Figure 1: The detailed architecture of our proposed approach. The figure depicts a concrete case where a sentence
contains an Attack event (triggered by “fired”) and 4 candidate arguments {e1 , e2 , e3 , e4 }. The encoder on the
left converts the sentence into intermediate continuous representations. Then the forward decoder and backward
decoder generates the argument roles sequences in a left-to-right and right-to-left manner (denoted by → −
 y i and ←
 y−i )
respectively. A classifier is finally adopted to make the final prediction y i . The forward and backward decoder
shares the instance feature extractor and generate the same instance representation xi for i-th entity. The histogram
in green and brown denotes the probability distribution generated by forward decoder and backward decoder re-
spectively. The orange histogram denotes the final predictions. Note that the histograms are for illustration only
and do not represent the true probability distribution.

 In particular, as Figure 1 shows, our architecture resentations as follows2 ,
consists of an encoder that converts the sentence S
 H = (h1 , ..., hn ) = F (w1 , ..., wn ) (1)
with a specific event trigger into intermediate vec-
torized representation and a decoder that generates where F (·) is the neural network to encode the sen-
a sequence of argument roles entity by entity. The tence. In this paper, we select BERT (Devlin et al.,
decoder is an entity-level recurrent network whose 2019) as the encoder. Considering representation
number of decoding steps is fixed, the same as the H does not contain event type information, which
entity number in the corresponding sentence. On is essential for predicting argument roles. We ap-
each decoding step, we feed the prediction results pend a special phrase denoting event type of t into
of the previously-processed entity into the recurrent each input sequence, such as “# ATTACK #”.
unit to make prediction for the current entity. Since
the predicted results of both left- and right-side en- 3.2 Decoder
tities can be potentially valuable information,we Different from traditional token-level Seq2Seq
further incorporate a bidirectional decoding mech- models, we use a bi-directional entity-level recur-
anism that integrates a forward decoding process rent decoder (BERD) with a classifier to generate a
and a backward decoding process effectively. sequence of argument roles entity by entity. BERD
 consists of a forward and backward recurrent de-
 coder, which exploit the same recurrent unit archi-
 tecture as follows.
3.1 Encoder
 3.2.1 Recurrent Unit
Given the sentence S = (w1 , ..., wn ) containing The recurrent unit is designed to explicitly utilize
a trigger t ∈ S and k candidate arguments E = two kinds of information: (1) the instance infor-
{e1 , ..., ek }, an encoder is adopted to encode the 2
 The trigger word can be detected by any event detection
word sequence into a sequence of continuous rep- model, which is not the scope of this paper.
mation which contains the sentence, event, and as the input feature representation for the argument
candidate argument (denoted by S, t, e); and (2) role classifier, and estimate the role that e plays in
contextual argument information which consists the event as follows:
of argument roles of other entities (denoted by A).
The recurrent unit exploits two corresponding fea- p = f (W [x; xa ] + b)
 (5)
ture extractors as follows: o = Softmax(p)
Instance Feature Extractor. Given the represen-
tation H generated by encoder, dynamic multi- where W and b are weight parameters. o is the
pooling (Chen et al., 2015) is then applied to extract probability distribution over the role label space.
max values of three split parts, which are decided For the sake of simplicity, in rest of the paper
by the event trigger and the candidate argument. we use Unit(S, t, e, A) to represent the calculation
The three hidden embeddings are aggregated into of probability distribution o by recurrent unit with
an instance feature representation x as follows: S, t, e, A as inputs.

 [x1,pt ]i = max{[h1 ]i , ..., [hpt ]i } 3.2.2 Forward Decoder
 [xpt +1,pe ]i = max{[hpt +1 ]i , ..., [hpe ]i } Given the sentence S with k candidate arguments
 (2) E = {e1 , ..., ek }, the forward decoder exploits
 [xpe +1,n ]i = max{[hpe +1 ]i , ..., [hn ]i }
 above recurrent unit and generates the argument
 x = [x1,pt ; xpt +1,pe ; xpe +1,n ] roles sequence in a left-to-right manner. The condi-
where [·]i is the i-th value of a vector, pt , pe are the tional probability of the argument roles sequence
positions of trigger t and candidate argument e 3 . is calculated as follows:
Argument Feature Extractor. To incorporate k
 Y
previously-generated arguments, we exploit CNN P (R|E, S, t) = p(yi |ei ; R
label space for i-th entity ei is calculated as follows: During training, we apply the teacher forcing
 mechanism where gold arguments information is
 ← ←
 −
 y−i = Unit(S, t, ei , Ai ) (9) fed into BERD’s recurrent units, enabling paral-
 ←− leled computation and greatly accelerates the train-
where Ai denotes the contextual argument in- ing process. Once the model is trained, we first
formation of i-th decoding step, which contains use the forward decoder with a greedy search to se-
previously-predicted argument roles R>i . We up- quentially generate a sequence of argument roles in
 ←−−
date Ai−1 by labeling ei as g(←
 y−i ) for next step i-1. a left-to-right manner. Then, the backward decoder
The argument feature extracted by recurrent units performs decoding in the same way but a right-to-
of backward decoder is denoted as ← −a .
 x i left manner. Finally, the classifier combines both
 left- and right-side argument features and make
3.2.4 Classifier
 prediction for each entity as Equation 10 shows.
To utilize both left- and right-side argument in-
formation, a classifier is then adopted to combine 4 Experiments
argument features of both decoders and make final
 4.1 Experimental Setup
prediction for each entity ei as follows:
 4.1.1 Dataset
 pi = f (Wc [xi ; →
 −
 x ai ; ←
 −a ] + b )
 x i c Following most works on EAE (Nguyen et al.,
 (10)
 y i = Softmax(pi ) 2016; Sha et al., 2018; Yang et al., 2019; Du and
 Cardie, 2020), we evaluate our models on the most
where yi denotes the final probability distribution widely-used ACE 2005 dataset, which contains 599
for ei . Wc and bc are weight parameters. documents annotated with 33 event subtypes and
 35 argument roles. We use the same test set con-
3.3 Training and Optimization
 taining 40 newswire documents, a development set
As seen, the forward decoder and backward de- containing 30 randomly selected documents and
coder in BERD mainly play two important roles. training set with the remaining 529 documents.
The first one is to yield intermediate argument fea- We notice Wang et al. (2019b) used TAC KBP
tures for the final classifier, and the second one is to dataset, which we can not access online or acquire
make the initial predictions fed into the argument from them due to copyright. We believe experi-
feature extractor. Since the initial predictions of menting with settings consistent with most related
the two decoders are crucial to generate accurate works (e.g., 27 out of 37 top papers used only the
argument features, we need to optimize their own ACE 2005 dataset in the last four years) should
classifier in addition to the final classifier. yield convincing empirical results.
 −−−−→ ←−−−−
 We use p(yi |ei ) and p(yi |ei ) to represent the
probability of ei playing role yi estimated by for- 4.1.2 Hyperparameters
ward and backward decoder respectively. p(yi |ei ) We adopt BERT (Devlin et al., 2019) as encoder
denotes the final estimated probability of ei playing and the proposed bi-directional entity-level recur-
role yi by Equation 10. The optimization objective rent decoder as decoder for the experiment. The
function is defined as follows: hyperparameters used in the experiment are listed.
 BERT. The hyperparameters of BERT are the same
 XX X as the BERTBASE model4 . We use a dropout prob-
J(θ) = − α log p(yi |ei ; R6=i , S, t) ability of 0.1 on all layers.
 S∈D t∈S ei ∈ES Argument Feature Extractor. Dimensions of word
 −−−−→ ←−−−− embedding, position embedding, event type em-
 + β log p(yi |ei ) + γ log p(yi |ei )
 bedding and argument role embedding for each
 (11)
 token are 100, 5, 5, 10 respectively. We utilize
where D denotes the training set and t ∈ S denotes 300 convolution kernels with size 3. The glove
the trigger word detected by previous event detec- embedding(Pennington et al., 2014) are utilized for
tion model in sentence S. ES represents the entity initialization of word embedding5 .
mentions in S. α, β and γ are weights for loss Training. Adam with learning rate of 6e-05, β1 =
of final classifier, forward decoder and backward 4
 https://github.com/google-research/bert
 5
decoder respectively. https://nlp.stanford.edu/projects/glove/
0.9, β2 = 0.999, L2 weight decay of 0.01 and Model P R F1
learning rate warmup of 0.1 is used for optimiza- DMBERT 56.9 57.4 57.2
tion. We set the training epochs and batch size HMEAE 62.2 56.6 59.3
to 40 and 30 respectively. Besides, we exploit a BERT(Inter) 58.4 57.1 57.8
dropout with rate 0.5 on the concatenated feature BERT(Intra) 56.4 61.2 58.7
representations. The loss weights α, β and γ are BERD 59.1 61.5 60.3
set to 1.0, 0.5 and 0.5 respectively.
 Table 2: Overall performance on ACE 2005 (%).
4.2 Baselines
We compare our method against the following four 4.3 Main Results
baselines. The first two are state-of-the-art models
that separately predicts argument without consid- The performance of BERD and baselines are shown
ering argument interaction. We also implement in Table 2 (statistically significant with p < 0.05),
two variants of DMBERT utilizing the latest inter- from which we have several main observations. (1)
event and intra-event argument interaction method, Compared with the latest best-performed baseline
named BERT(Inter) and BERT(Intra) respectively. HMEAE, our method BERD achieves an absolute
 improvement of 1.0 F1, clearly achieving compet-
 1. DMBERT which adopts BERT as encoder itive performance. (2) Incorporation of argument
 and generate representation for each entity interactions brings significant improvements over
 mention based on dynamic multi-pooling op- vanilla DMBERT. For example, BERT(Intra) gains
 eration(Wang et al., 2019a). The candidate a 1.5 F1 improvement compared with DMBERT,
 arguments are predicted separately. which has the same architecture except for argu-
 ment interaction. (3) Intra-event argument interac-
 2. HMEAE which utilizes the concept hierar- tion brings more benefit than inter-event interaction
 chy of argument roles and utilizes hierarchical (57.8 of BERT(Inter) v.s. 58.7 of BERT(Intra) v.s.
 modular attention for event argument extrac- 60.3 of BERD). (4) Compared with BERT(Inter)
 tion (Wang et al., 2019b). and BERT(Intra), our proposed BERD achieves the
 3. BERT(Inter) which enhances DMBERT most significant improvements. We attribute the
 with inter-event argument interaction adopted solid enhancement to BERD’s novel seq2seq-like
 by Nguyen et al. (2016). The memory ma- architecture that effectively exploits the argument
 trices are introduced to store dependencies roles of contextual entities.
 among event triggers and argument roles.
 4.4 Effect of Entity Numbers
 4. BERT(Intra) which incorporates intra-event To further investigate how our method improves
 argument interaction adopted by Sha et al. performance, we conduct comparison and analysis
 (2018) into DMBERT. The tensor layer and on effect of entity numbers. Specifically, we first
 self-matching attention matrix with the same divide the event instances of test set into some
 settings are applied in the experiment. subsets based on the number of entities in an event.
 Since events with a specific number of entities may
 Following previous work (Wang et al., 2019b),
 be too few, results on a subset of a range of entity
we use a pipelined approach for event extraction
 numbers will yield more robust and convincing
and implement DMBERT as event detection model.
 conclusion. To make the number of events in all
The same event detection model is used for all the
 subsets as balanced as possible, we finally get a
baselines to ensure a fair comparison.
 division of four subsets, whose entity numbers are
 Note that Nguyen et al. (2016) uses the last word
 in the range of [1,3], [4,6], [7,9], and [10,] and
to represent the entity mention6 , which may lead
 event quantities account for 28.4%, 28.2%, 25.9%,
to insufficient semantic information and inaccurate
 and 17.5%, respectively.
evaluation considering entity mentions may consist
 The performance of all models on the four sub-
of multiple words and overlap with each other. We
 sets is shown in Figure 2, from which we can ob-
sum hidden embedding of all words when collect-
 serve a general trend that BERD outperforms other
ing lexical features for each entity mention.
 baselines more significantly if more entities appear
 6
 Sha et al. (2018) doesn’t introduce the details. in an event. More entities usually mean more com-
DMBERT Model Subset-O Subset-N
 64 HMEAE
 BERT(Inter) DMBERT 56.4 59.4
 62
 BERT(Intra) HMEAE 58.8 59.6
 BERD
 BERT(Inter) 57.3 58.8
 60 BERT(Intra) 58.5 59.2
F1-Score(%)

 58 BERD 60.5 60.1

 56 Table 3: Comparison on sentences with and with-
 out overlapping entities (Subset-O v.s. Subset-N). F1-
 54 Score (%) is listed.
 52

 Subset-1 Subset-2 Subset-3 Subset-4
 significant. We attribute it to BERD’s capability of
 [1,3] [4,6] [7,9] [10,] distinguishing argument distribution patterns.
Figure 2: Comparison on four subsets with different 4.6 Effect of the Bidirectional Decoding
range of entity numbers. F1-Score (%) is listed.
 To further investigate the effectiveness of the bidi-
 rectional decoding process, we exclude the back-
plex contextual information for a candidate argu- ward decoder or forward decoder from BERD and
ment, which will lead to a performance degradation. obtain two models with only unidirectional decoder,
BERD alleviates degradation better because of its whose performance is shown in the lines of “-w/
capability of capturing argument role information Forward Decoder” and “-w/ Backward Decoder” in
of contextual entities. We notice that BERT(Intra) Table 4. From the results, we can observe that: (1)
also outperforms DMBERT significantly on Subset- When decoding with only forward or backward de-
4, which demonstrates the effectiveness of intra- coder, the performance decreases by 1.6 and 1.3 in
event argument interaction. terms of F1 respectively. The results clearly demon-
 Note that the performance on Subset-1 is worse strate the superiority of the bidirectional decod-
than that on Subset-2, looking like an outlier. The ing mechanism (2) Though the two model variants
reason lies in that the performance of the first-stage have performance degradation, they still outper-
event detection model on Subset-1 is much poorer form DMBERT significantly, once again verifying
(e.g., 32.8 of F1 score for events with one entity). that exploiting contextual argument information,
4.5 Effect of Overlapping Entity Mentions even in only one direction, is beneficial to EAE.

Though performance improvement can be easily Model P R F1
observed, it is nontrivial to quantitatively verify DMBERT 56.9 57.4 57.2
how BERD captures the distribution pattern of mul- BERD 59.1 61.5 60.3
tiple argument roles within an event. In this section, -w/ Forward Decoder 58.0 59.4 58.7
we partly investigate this problem by exploring the -w/ Backward Decoder 58.3 59.8 59.0
effect of overlapping entities. Since there is usually -w/ Forward Decoder x2 56.8 61.1 58.9
only one entity serving as argument roles in multi- -w/ Backward Decoder x2 57.2 61.0 59.1
ple overlapping entities, we believe sophisticated -w/o Recurrent Mechanism 55.3 60.0 57.4
EAE models should identify this pattern. Therefore,
we divide the test set into two subsets (Subset-O Table 4: Ablation study on ACE 2005 dataset (%).
and Subset-N ) based on whether an event contains
overlapping entity mentions and check all models’ Considering number of model parameters will
performance on these two subsets. Table 3 shows be decreased by excluding the forward/backward
the results, from which we can find that all base- decoder, we build another two model variants with
lines perform worse on Subset-O. It is a natural two decoders of the same direction (denoted by
result since multiple overlapping entities usually “-w/ Forward Decoder x2” and “-w/ Backward De-
have similar representations, making the pattern coder x2”), whose parameter numbers are exactly
mentioned above challenging to capture. BERD equal to BERD. Table 4 shows that the two enlarged
performs well in both Subset-O and Subset-N, and single-direction models have similar performance
the superiority on Subset-O over baseline is more with their original versions. We can conclude that
the improvement comes from complementation of S1:Tens of thousands of destitute Africans try to enter Spain illegally each
 year by crossing the perilous Strait of Gibraltar to reach the southern
the two decoders with different directions, rather mainland or by sailing northwest to the Canary Islands out in the Atlantic
than increment of model parameters. (the perilous Strait of Gibraltar, N/A, Destination( ), Destination( ), N/A)
 (the southern mainland, N/A, Destination( ), Destination( ), N/A)
 Besides, we exclude the recurrent mechanism by (the Canary Islands out in the Atlantic, Destination, Destination, Destination,
 Destination)
preventing argument role predictions of contextual
 S2: Ankara police chief Ercument Yilmaz visited the site of the morning blast
entities from being fed into the decoding module, but refused to say if a bomb had caused the explosion
obtaining another model variant named “-w/o Re- (Ankara, N/A, Artifact( ), N/A, N/A)
 (Ankara police, N/A, Artifact( ), N/A, N/A)
current Mechanism”. The performance degradation (Ankara police chief, N/A, Artifact( ), Artifact( ), N/A)
 (Ankara police chief Ercument Yilmaz, Artifact, Artifact, Artifact, Artifact)
clearly shows the value of the recurrent decoding
 S3:Prison authorities have given the node for Anwar to be token home later
process incorporating argument role information. in the afternoon to marry his eldest daughter, Nurul Izzah, to engineer Raja
 Ahmad Sharir Iskandar in a traditional Malay ceremony, he said
 (home, Place, Place, Place, Time-Within( ) )
4.7 Case Study and Error Analysis (later in the afternoon, Time-Within, Time-Within, Time-Within, N/A( ))

To promote understanding of our method, we Figure 3: Case study. Entities and triggers are high-
demonstrate three concrete examples in Figure 3. lighted by green and purple respectively. Each tuple
Sentence S1 contains a Transport event triggered (E,G,P1 ,P2 ,P3 ) denotes the predictions for an entity E
by “sailing”. DMBERT and BERT(Intra) assigns with gold label G, where P1 , P2 and P3 denotes predic-
Destination role to candidate argument “the per- tion of DMBERT, BERT(Intra) and BERD respectively.
ilous Strait of Gibraltar ”, “the southern mainland” Incorrect predictions are denoted by a red mark.
and “the Canary Islands out in the Atlantic”, the
first two of which are mislabeled. It’s an unusual predicted as N/A.
pattern that a Transport event contains multiple
destinations. DMBERT and BERT(Intra) fail to 5 Related Work
recognize the information of such patterns, show-
ing that they can not well capture this type of corre- We have covered research on EAE in Section 1,
lation among prediction results. Our BERD, how- related work that inspires our technical design is
ever, leverages previous predictions to generate mainly introduced in the following.
argument roles entity by entity in a sentence, suc- Though our recurrent decoder is entity-level, our
cessfully avoiding the unusual pattern happening. bidirectional decoding mechanism is inspired by
 S2 contains a Transport event triggered by “vis- some bidirectional decoders in token-level Seq2Seq
ited”, and 4 nested entities exists in the phrase models, e.g., of machine translation (Zhou et al.,
“Ankara police chief Ercument Yilmaz”. Since 2019), speech recognition (Chen et al., 2020) and
these nested entities share the same sentence con- scene text recognition (Gao et al., 2019).
text, it is not strange that DMBERT wrongly pre- We formalize the task of EAE as a Seq2Seq-like
dicts such entities as the same argument role Arti- learning problem instead of a classic classification
fact. Thanks to the bidirectional entity-level recur- problem or sequence labeling problem. We have
rent decoder, our method can recognize the distribu- found that there are also some works performing
tion pattern of arguments better and hence correctly classification or sequence labeling in a Seq2Seq
identifies these nested entities as false instances. manner in other fields. For example, Yang et al.
In this case, BERD reduces 3 false-positive pre- (2018) formulates the multi-label classification task
dictions compared with DMBERT, confirming the as a sequence generation problem to capture the
results and analysis of Table 3. correlations between labels. Daza and Frank (2018)
 explores an encoder-decoder model for semantic
 As a qualitative error analysis, the last example
 role labeling. We are the first to employ a Seq2Seq-
S3 demonstrates that incorporating previous predic-
 like architecture to solve the EAE task.
tions may also lead to error propagation problem.
S3 contains a Marry event triggered by “marry”. 6 Conclusion
Entity “home” is mislabeled as Time-Within role
by BERD and this wrong prediction will be used We have presented BERD, a neural architecture
as argument features to identify entity “later in this with a Bidirectional Entity-level Recurrent Decoder
after”, whose role is Time-Within. As analyzed in that achieves competitive performance on the task
the first case, BERD tends to avoid repetitive roles of event argument extraction (EAE). One main
in a sentence, leading this entity incorrectly being characteristic that distinguishes our techniques
from previous works is that we formalize EAE Xinya Du and Claire Cardie. 2020. Event extraction by
as a Seq2Seq-like learning problem instead of a answering (almost) natural questions. In Proceed-
 ings of the 2020 Conference on Empirical Methods
classic classification or sequence labeling problem.
 in Natural Language Processing (EMNLP), pages
The novel bidirectional decoding mechanism en- 671–683.
ables our BERD to utilize both the left- and right-
side argument predictions effectively to generate Yunze Gao, Yingying Chen, Jinqiao Wang, and Han-
 qing Lu. 2019. Gate-based bidirectional interactive
a sequence of argument roles that follows overall decoding network for scene text recognition. In Pro-
distribution patterns over a sentence better. ceedings of the 28th ACM International Conference
 As pioneer research that introduces the Seq2Seq- on Information and Knowledge Management, pages
like architecture into the EAE task, BERD also 2273–2276.
faces some open questions. For example, since we Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Col-
use gold argument roles as prediction results during lobert. 2019. Sequence-to-sequence speech recogni-
training, how to alleviate the exposure bias problem tion with time-depth separable convolutions. Proc.
 Interspeech 2019, pages 3785–3789.
is worth investigating. We are also interested in in-
corporating our techniques into more sophisticated Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-
models that jointly extract triggers and arguments. ishman. 2016. Joint event extraction via recurrent
 neural networks. In Proceedings of the 2016 Confer-
Acknowledgements ence of the NAACL, pages 300–309.
 Thien Huu Nguyen and Ralph Grishman. 2015. Event
We thank anonymous reviewers for valuable com- detection and domain adaptation with convolutional
ments. This research was supported by the Na- neural networks. In Proceedings of the 53rd Annual
tional Key Research And Development Program Meeting of the Association for Computational Lin-
of China (No.2019YFB1405802) and the central guistics and the 7th International Joint Conference
 on Natural Language Processing (Volume 2: Short
government guided local science and technology Papers), pages 365–371.
development fund projects (science and technology
innovation base projects) No. 206Z0302G. Jeffrey Pennington, Richard Socher, and Christopher D
 Manning. 2014. Glove: Global vectors for word rep-
 resentation. In Proceedings of the 2014 conference
 on empirical methods in natural language process-
References ing (EMNLP), pages 1532–1543.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
 Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li,
 gio. 2014. Neural machine translation by jointly
 Baobao Chang, and Zhifang Sui. 2016. Rbpb:
 learning to align and translate. arXiv preprint
 Regularization-based pattern balancing method for
 arXiv:1409.0473.
 event extraction. In Proceedings of the 54th An-
Xi Chen, Songyang Zhang, Dandan Song, Peng nual Meeting of the Association for Computational
 Ouyang, and Shouyi Yin. 2020. Transformer with Linguistics (Volume 1: Long Papers), pages 1224–
 bidirectional decoder for speech recognition. pages 1234.
 1773–1777. Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui.
 2018. Jointly extracting event triggers and argu-
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and ments by dependency-bridge RNN and tensor-based
 Jun Zhao. 2015. Event extraction via dynamic multi- argument interaction. In Proceedings of the 32
 pooling convolutional neural networks. In Proceed- AAAI, New Orleans, Louisiana, USA, pages 5916–
 ings of the 53rd Annual Meeting of the ACL and the 5923.
 7th IJCNLP, pages 167–176.
 Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and
Angel Daza and Anette Frank. 2018. A sequence-to- Chandan K Reddy. 2021. Neural abstractive text
 sequence model for semantic role labeling. In Pro- summarization with sequence-to-sequence models.
 ceedings of The Third Workshop on Representation ACM Transactions on Data Science, 2(1):1–37.
 Learning for NLP, pages 207–216.
 Shengli Song, Haitao Huang, and Tongxiao Ruan.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2019. Abstractive text summarization using lstm-
 Kristina Toutanova. 2019. Bert: Pre-training of cnn based deep learning. Multimedia Tools and Ap-
 deep bidirectional transformers for language under- plications, 78(1):857–875.
 standing. In Proceedings of the 2019 Conference of
 the North American Chapter of the Association for Zoltán Tüske, Kartik Audhkhasi, and George Saon.
 Computational Linguistics: Human Language Tech- 2019. Advancing sequence-to-sequence based
 nologies, Volume 1 (Long and Short Papers), pages speech recognition. In INTERSPEECH, pages
 4171–4186. 3780–3784.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob References
 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
 Kaiser, and Illia Polosukhin. 2017. Attention is all Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
 you need. In Advances in neural information pro- gio. 2014. Neural machine translation by jointly
 cessing systems, pages 5998–6008. learning to align and translate. arXiv preprint
 arXiv:1409.0473.
Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, Xi Chen, Songyang Zhang, Dandan Song, Peng
 and Peng Li. 2019a. Adversarial training for weakly Ouyang, and Shouyi Yin. 2020. Transformer with
 supervised event detection. In Proceedings of the bidirectional decoder for speech recognition. pages
 2019 Conference of the NAACL, pages 998–1008. 1773–1777.

Xiaozhi Wang, Ziqi Wang, Xu Han, Zhiyuan Liu, Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and
 Juanzi Li, Peng Li, Maosong Sun, Jie Zhou, and Jun Zhao. 2015. Event extraction via dynamic multi-
 Xiang Ren. 2019b. Hmeae: Hierarchical modu- pooling convolutional neural networks. In Proceed-
 lar event argument extraction. In Proceedings of ings of the 53rd Annual Meeting of the ACL and the
 the 2019 Conference on Empirical Methods in Nat- 7th IJCNLP, pages 167–176.
 ural Language Processing and the 9th International
 Joint Conference on Natural Language Processing Angel Daza and Anette Frank. 2018. A sequence-to-
 (EMNLP-IJCNLP), pages 5781–5787. sequence model for semantic role labeling. In Pro-
 ceedings of The Third Workshop on Representation
 Learning for NLP, pages 207–216.
Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei
 Wu, and Houfeng Wang. 2018. Sgm: Sequence Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
 generation model for multi-label classification. In Kristina Toutanova. 2019. Bert: Pre-training of
 Proceedings of the 27th International Conference on deep bidirectional transformers for language under-
 Computational Linguistics, pages 3915–3926. standing. In Proceedings of the 2019 Conference of
 the North American Chapter of the Association for
S Yang, DW Feng, LB aQiao, ZG Kan, and DS Li. Computational Linguistics: Human Language Tech-
 2019. Exploring pre-trained language models for nologies, Volume 1 (Long and Short Papers), pages
 event extraction and generation. In Proceedings of 4171–4186.
 the 57th Annual Meeting of the ACL.
 Xinya Du and Claire Cardie. 2020. Event extraction by
 answering (almost) natural questions. In Proceed-
Hong Yu, Zhang Jianfeng, Ma Bin, Yao Jianmin, Zhou ings of the 2020 Conference on Empirical Methods
 Guodong, and Zhu Qiaoming. 2011. Using cross- in Natural Language Processing (EMNLP), pages
 entity inference to improve event extraction. In Pro- 671–683.
 ceedings of the 49th Annual Meeting of the ACL,
 pages 1127–1136. Association for Computational Yunze Gao, Yingying Chen, Jinqiao Wang, and Han-
 Linguistics. qing Lu. 2019. Gate-based bidirectional interactive
 decoding network for scene text recognition. In Pro-
Ying Zeng, Honghui Yang, Yansong Feng, Zheng ceedings of the 28th ACM International Conference
 Wang, and Dongyan Zhao. 2016. A convolution bil- on Information and Knowledge Management, pages
 stm neural network model for chinese event extrac- 2273–2276.
 tion. In NLUIA, pages 275–287. Springer.
 Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Col-
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron- lobert. 2019. Sequence-to-sequence speech recogni-
 grong Ji, and Hongji Wang. 2018. Asynchronous tion with time-depth separable convolutions. Proc.
 bidirectional decoding for neural machine transla- Interspeech 2019, pages 3785–3789.
 tion. In Proceedings of the AAAI Conference on Ar- Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-
 tificial Intelligence, volume 32. ishman. 2016. Joint event extraction via recurrent
 neural networks. In Proceedings of the 2016 Confer-
Zhisong Zhang, Xiang Kong, Zhengzhong Liu, Xuezhe ence of the NAACL, pages 300–309.
 Ma, and Eduard Hovy. 2020. A two-step approach
 for implicit event argument detection. In Proceed- Thien Huu Nguyen and Ralph Grishman. 2015. Event
 ings of the 58th Annual Meeting of the Association detection and domain adaptation with convolutional
 for Computational Linguistics, pages 7479–7485. neural networks. In Proceedings of the 53rd Annual
 Meeting of the Association for Computational Lin-
Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019. guistics and the 7th International Joint Conference
 Synchronous bidirectional neural machine transla- on Natural Language Processing (Volume 2: Short
 tion. Transactions of the Association for Computa- Papers), pages 365–371.
 tional Linguistics, 7:91–105.
 Jeffrey Pennington, Richard Socher, and Christopher D
 Manning. 2014. Glove: Global vectors for word rep-
 resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process- Hong Yu, Zhang Jianfeng, Ma Bin, Yao Jianmin, Zhou
 ing (EMNLP), pages 1532–1543. Guodong, and Zhu Qiaoming. 2011. Using cross-
 entity inference to improve event extraction. In Pro-
Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, ceedings of the 49th Annual Meeting of the ACL,
 Baobao Chang, and Zhifang Sui. 2016. Rbpb: pages 1127–1136. Association for Computational
 Regularization-based pattern balancing method for Linguistics.
 event extraction. In Proceedings of the 54th An-
 nual Meeting of the Association for Computational Ying Zeng, Honghui Yang, Yansong Feng, Zheng
 Linguistics (Volume 1: Long Papers), pages 1224– Wang, and Dongyan Zhao. 2016. A convolution bil-
 1234. stm neural network model for chinese event extrac-
 tion. In NLUIA, pages 275–287. Springer.
Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui.
 Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron-
 2018. Jointly extracting event triggers and argu-
 grong Ji, and Hongji Wang. 2018. Asynchronous
 ments by dependency-bridge RNN and tensor-based
 bidirectional decoding for neural machine transla-
 argument interaction. In Proceedings of the 32
 tion. In Proceedings of the AAAI Conference on Ar-
 AAAI, New Orleans, Louisiana, USA, pages 5916–
 tificial Intelligence, volume 32.
 5923.
 Zhisong Zhang, Xiang Kong, Zhengzhong Liu, Xuezhe
Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Ma, and Eduard Hovy. 2020. A two-step approach
 Chandan K Reddy. 2021. Neural abstractive text for implicit event argument detection. In Proceed-
 summarization with sequence-to-sequence models. ings of the 58th Annual Meeting of the Association
 ACM Transactions on Data Science, 2(1):1–37. for Computational Linguistics, pages 7479–7485.

Shengli Song, Haitao Huang, and Tongxiao Ruan. Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019.
 2019. Abstractive text summarization using lstm- Synchronous bidirectional neural machine transla-
 cnn based deep learning. Multimedia Tools and Ap- tion. Transactions of the Association for Computa-
 plications, 78(1):857–875. tional Linguistics, 7:91–105.

Zoltán Tüske, Kartik Audhkhasi, and George Saon.
 2019. Advancing sequence-to-sequence based
 speech recognition. In INTERSPEECH, pages
 3780–3784.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
 Kaiser, and Illia Polosukhin. 2017. Attention is all
 you need. In Advances in neural information pro-
 cessing systems, pages 5998–6008.

Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun,
 and Peng Li. 2019a. Adversarial training for weakly
 supervised event detection. In Proceedings of the
 2019 Conference of the NAACL, pages 998–1008.

Xiaozhi Wang, Ziqi Wang, Xu Han, Zhiyuan Liu,
 Juanzi Li, Peng Li, Maosong Sun, Jie Zhou, and
 Xiang Ren. 2019b. Hmeae: Hierarchical modu-
 lar event argument extraction. In Proceedings of
 the 2019 Conference on Empirical Methods in Nat-
 ural Language Processing and the 9th International
 Joint Conference on Natural Language Processing
 (EMNLP-IJCNLP), pages 5781–5787.

Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei
 Wu, and Houfeng Wang. 2018. Sgm: Sequence
 generation model for multi-label classification. In
 Proceedings of the 27th International Conference on
 Computational Linguistics, pages 3915–3926.

S Yang, DW Feng, LB aQiao, ZG Kan, and DS Li.
 2019. Exploring pre-trained language models for
 event extraction and generation. In Proceedings of
 the 57th Annual Meeting of the ACL.
You can also read