EIDER: Evidence-enhanced Document-level Relation Extraction

 
CONTINUE READING
E IDER: Evidence-enhanced Document-level Relation Extraction

                                                            Yiqing Xie, Jiaming Shen, Sha Li, Yuning Mao, Jiawei Han
                                                                University of Illinois at Urbana-Champaign, IL, USA
                                                                 {xyiqing2, js2, shal2, yuningm2, hanj}@illinois.edu

                                                              Abstract                           h: Hero of the Day     t: the United States    r: [country of origin]
                                                                                                 Ground truth evidence: [1,10]             Predicted evidence: [1,10]

                                             Document-level relation extraction (DocRE)          Original document as input: [1] Load is the sixth studio album by
                                                                                                 the American heavy metal band Metallica , released on June 4,
                                             aims at extracting the semantic relations           1996 by Elektra Records in the United States and by Vertigo
                                             among entity pairs in a document. In DocRE,         Records internationally. … [9] It was certified 5×platinum by the
arXiv:2106.08657v1 [cs.CL] 16 Jun 2021

                                             a subset of the sentences in a document, called     Recording Industry Association of America ( RIAA ) for shipping
                                                                                                 five million copies in the United States. [10] Four singles—"Until It
                                             the evidence sentences, might be sufficient for     Sleeps", "Hero of the Day", "Mama Said" , and "King Nothing"—
                                             predicting the relation between a specific en-      were released as part of the marketing campaign for the album.
                                             tity pair. To make better use of the evidence       Prediction results (logits): NA: 17.63 country of origin: 14.79

                                             sentences, in this paper, we propose a three-       Predicted evidence as input: [1] Load is … in the United States
                                                                                                 and by Vertigo Records internationally. [10] Four singles—"Until It
                                             stage evidence-enhanced DocRE framework             Sleeps", "Hero of the Day", … for the album.
                                             consisting of joint relation and evidence ex-       Prediction results (logits): country of origin: 18.31 NA: 13.45
                                             traction, evidence-centered relation extraction     Final prediction result of our model: country of origin
                                             (RE), and fusion of extraction results. We
                                             first jointly train an RE model with a sim-         Figure 1: A test sample in the DocRED dataset (Yao
                                             ple and memory-efficient evidence extraction        et al., 2019), where the ith sentence in the document
                                             model. Then, we construct pseudo documents          is marked with [i] at the start. Our model correctly
                                             based on the extracted evidence sentences and       predicts [1,10] as evidence, and if we only use the ex-
                                             run the RE model again. Finally, we fuse the        tracted evidence as input, the model can predict the re-
                                             extraction results of the first two stages us-      lation “country of origin” correctly.
                                             ing a blending layer and make a final predic-
                                             tion. Extensive experiments show that our pro-
                                             posed framework achieves state-of-the-art per-      level relation extraction (DocRE) (Quirk and Poon,
                                             formance on the DocRED dataset, outperform-         2017; Peng et al., 2017; Gupta et al., 2019).
                                             ing the second best method by 0.76/0.82 Ign            Although multiple sentences might be necessary
                                             F1/F1. In particular, our method significantly      to infer a relation, the sentences in a document are
                                             improves the performance on inter-sentence re-      not equally important for each entity pair and some
                                             lations by 1.23 Inter F1.
                                                                                                 sentences could be irrelevant for the relation pre-
                                                                                                 diction. We refer to the minimal set of sentences
                                         1   Introduction
                                                                                                 required to infer a relation as evidence sentences
                                         Relation extraction (RE) is the task of extracting      (Huang et al., 2021). In the example shown in Fig-
                                         semantic relations among entities within a given        ure 1, the 1st and 10th sentences serve as evidence
                                         text, which is a critical step in information extrac-   sentences to the “country of origin” relation be-
                                         tion and has abundant applications (Yu et al., 2017;    tween “Hero of the Day” and “the United States”.
                                         Shi et al., 2019; Trisedya et al., 2019). Prior stud-   The 1st sentence indicates that Load is originated
                                         ies mostly focus on sentence-level RE, where the        from the United States, and the 10th indicates Hero
                                         two entities of interest co-occur in the same sen-      of the Day is a song of Load. Hence, Hero of the
                                         tence and it is assumed that their relation can be      Day is also from the United States. On the other
                                         derived from the sentence (Zeng et al., 2015; Cai       hand, although the 9th sentence also mentions “the
                                         et al., 2016). However, this assumption does not        United States”, it is irrelevant to the relation pre-
                                         always hold and some relations between entities         diction. Sometimes including such irrelevant sen-
                                         can only be inferred given multiple sentences as the    tences in the input might introduce noise to the
                                         context. As a result, recent studies have been mov-     model and be more detrimental than beneficial.
                                         ing towards the more realistic setting of document-        In light of the observations above, we propose
two approaches to make better use of evidence sen-       and make another set of predictions based on the
tences. The first is to jointly extract relations and    pseudo document. In the last stage, we fuse the
evidence. Intuitively, both tasks should focus on        predictions based on the original document and the
the information relevant to the current entity pair,     pseudo document using a blending layer (Wolpert,
such as the underlined “Load” and “the album” in         1992). In this way, E IDER puts more attention to
the 10th sentence of Figure 1. This suggests that        the important sentences extracted in the first stage,
the two tasks have commonality and can provide ad-       while still having access to the whole document to
ditional training signals for each other. Huang et al.   avoid information loss.
(2020) trains these two tasks in a multi-task learn-     Contributions. (1) We propose a memory-
ing manner. However, their model makes one pre-          efficient multi-task learning DocRE framework for
diction for every    joint relation and evidence extraction, which only
tuple, which requires massive GPU memory and             requires around 14% additional memory compared
long training time. Our method adopts a much sim-        to a single RE model alone. (2) We refine the
pler model structure, which only predicts for each       model inference stage by using a blending layer to
(relation, entity, entity) tuple and can be trained on   fuse the prediction results from the original docu-
a single consumer GPU.                                   ment and the extracted evidence, which allows the
   The second approach is to conduct evidence-           model to focus more on the important sentences
centered relation extraction with the evidence sen-      with no information loss. (3) We conduct exten-
tences as model input. In the extreme case, if there     sive experiments to demonstrate the effectiveness
is only one sentence related to the relation predic-     of our method, which achieves state-of-the-art per-
tion, we can make predictions solely based on this       formance on the large-scale DocRED dataset.
sentence and reduce the problem to sentence-level
relation extraction. One concurrent work (Huang          2   Problem Formulation
et al., 2021) shows the effectiveness of replacing
the original documents with sentences extracted          Given a document d comprised of N sentences
by hand-crafted rules. However, the sentences ex-        {st }Nt=1 and a set of entities {ei } appearing in d, the
tracted by rules are not perfect. Solely relying         task of document-level relation extraction (DocRE)
on heuristically extracted sentences may result in       is to predict the relations between all entity S pairs
information loss and harm model performance in           (eh , et ) from a pre-defined relation set R {NA}.
certain cases. Instead, our evidence is obtained by      We refer to eh and et as the head entity and tail en-
a dedicated evidence extraction model. In addition,      tity, respectively. An entity ei may appear multiple
we fuse the prediction results of both the original      times in document d, where we denote its corre-
document and extracted evidence to avoid informa-        sponding mentions as {mij }. A relation r ∈ R
tion loss, and demonstrate that the two sources of       between (eh , et ) exists if it is expressed by any
predictions are complementary to each other.             pair of their mentions, and otherwise labeled as
   Specifically, in this paper, we propose an            NA. For each entity pair (eh , et ) that possesses a
evidence-enhanced RE framework, named E IDER,            non-NA relation, we define its evidence sentences1
which automatically extracts evidence and effec-         Vh,t = {svi }K  i=1 as the subset of sentences in the
tively leverages the extracted evidence to improve       document that are sufficient for human annotators
the performance of DocRE in three stages. In the         to infer the relation.
first stage, we train a relation extraction model and
an evidence extraction model in a multi-task learn-      3   Methodology
ing manner. We adopt localized context pooling
(Zhou et al., 2021) in both models, which enhances       The framework of E IDER consists of three stages:
the entity embedding with additional context rel-        joint relation and evidence extraction (Sec. 3.1),
evant to the current entity pair. To reduce mem-         evidence-centered relation extraction (Sec. 3.2)
ory usage and training time, we use the same sen-        and fusion of extraction results (Sec. 3.3). An
tence representation for each relation and only train    illustration of our framework is shown in Figure 2.
the evidence extraction model on entity pairs with
at least one relation. In the second stage, we re-          1
                                                              We use “evidence sentence” and “evidence” interchange-
gard the extracted evidence as a pseudo document         ably through the paper.
Predicted relation from Orig doc: NA             Predicted evidence: [1, 4]                            Final predicted relation: Country & Located in
                                                                                                                                          (Learned threshold: -0.28)
                         Relation Classifier                   Evidence Classifier
                                                                                     Sent embs
      Head emb                                                                                                                   Blending Layer
                                                                                             …   [1]
            …
                                               Context emb                                   …   [2]
            …
                                                       …                                 …                  Pred Scores from Orig doc     Pred Scores from Pseudo doc
      Tail emb                                                                               …   [4]              Country: -1.05                  Country: 3.76
                  Weights                                                                                       Located in: -1.82               Located in: 2.70
                                                           …                         …                         Citizenship: -11.53              Citizenship: -5.54
                                                                                                                       …                                …
       Paul Desmarais is a Canadian businessman … Desmarais was born in Ontario …

       Attention to head & tail           Canadian                     Ontario
                                                                                                        After convergence        Trained model
                                       Pre-trained Encoder

       Original Document: [1] Paul Desmarais is a Canadian businessman in his                                 Pseudo Document: [1] Paul Desmarais is a
       hometown of Montreal . [2] He is the eldest son of Paul Desmarais Sr. and                              Canadian businessman in his hometown of
       Jacqueline … [4] Desmarais was born in Ontario . [5] He was educated at …                              Montreal . [4] Desmarais was born in Ontario .

                            Joint Relation and Evidence Extraction                                            Evidence-Centered Relation Extraction & Fusion

                                                       Figure 2: The overall architecture of E IDER.

3.1     Joint Relation and Evidence Extraction                                               attention towards both eh and et are important to
In our framework, the relation extraction model                                              both entities and hence essential to the relation. We
and evidence extraction model share a pre-trained                                            first compute the attention from each token to each
encoder and have their own prediction heads. Intu-                                           mention mj under the k th head, noted as AM         l
                                                                                                                                          j,k ∈ R .
itively, tokens relevant to the relation are essential                                       Then, we compute the attention from each token
in both models, such as “Paul Desmarais” in the                                              to each entity ei by averaging attention over men-
1st sentence and “Desmarais” in the 4th sentence                                             tions mij ∈ ei , denoted as AE         l
                                                                                                                           i,k ∈ R . The context
in the example shown in Figure 2. By sharing the                                             embedding of (eh , et ) is then obtained by:
base encoder, the two models are able to provide
additional training signals to each other and hence                                                       ch,t = Ha(h,t)
mutually enhance each other (Ruder, 2017; Shen                                                                                     K
                                                                                                                                   X                                    (3)
et al., 2018; Liu et al., 2019).                                                                        a(h,t) = softmax(                 AE      E
                                                                                                                                           h,k · At,k ).
Encoder. Given a document d = [st ]Nt=1 , we insert
                                                                                                                                    i=1
a special token “*” before and after each entity
mention. We then encode the document with a
                                                                                         Relation Prediction Head. We first map the em-
pre-trained BERT encoder (Devlin et al., 2019) to
                                                                                         beddings of eh and et to context-aware representa-
obtain the embedding of each token:
                                                                                         tions zh , zt by combining their entity embeddings
  H = [h1 , ..., hL ] = Encoder([s1 , ..., sN ]). (1)                                    with the context embedding ch,t , and then obtain
                                                                                         the probability of relation r ∈ R holds between
For each mention of an entity ei , we first use the                                      (eh , et ) via a bilinear function:
embedding of the start symbol “*” as its mention
embedding. Then, we adopt LogSumExp pooling
                                                                                                                   zh = tanh (Wh eh + ch,t ) ,
to obtain the embedding of entity ei , which is a
smooth version of max pooling:                                                                                      zt = tanh (Wt et + ch,t ) ,                         (4)
                                                                                                       P (r|eh , et ) = σ (zh Wr zt + br ) ,
                                   Nei                     
                                   X
                    ei = log              exp mij ,                              (2)
                                    j,1
                                                                                         where Wh , Wt , Wr , br are learnable parameters.
                                                                                         We adopt the adaptive-thresholding loss (Zhou
where Nei is the number of entity ei ’s occurrence                                       et al., 2021) as the loss function of the RE model.
in the document and mij is the embedding of its                                          Specifically, We say a relation is in the positive
j th mention. We compute a context embedding                                             class PT if it actually exists between the entity pair,
for each entity pair (eh , et ) based on the attention                                   and otherwise it is in the negative classes NT . Then
matrix A ∈ RK×l×l in the pre-trained encoder fol-                                        we introduces a dummy relation class TH, and en-
lowing (Zhou et al., 2021), where K is the number                                        courages the logits of positive classes to be larger
of attention heads. Intuitively, tokens with high                                        than that of TH, and the logits of negative classes
smaller than TH:                                                  Inference. After the model is trained, we feed the
                                                              !   original documents as input for relation extraction.
              X                 exp (logitr )                     For each entity pair (eh , et ), we obtain the predic-
LRE = −            log     P
           r∈PT           r0 ∈PT ∪{TH} exp (logitr0 )             tion score of each relation r ∈ R by:
                                              !
                      exp (logitTH )
                                                                              (
                                                                                  logitr − logitT H   if logitr ∈ top_k(logit)
      − log P                                   ,                  Sh,t,r =                                                    (9)
                 r0 ∈NT ∪{TH} exp (logitr0 )                                      − inf               otherwise,
                                                  (5)
where logit is the hidden representation in the last              where top_k(logit) means the top k relations with
layer before Sigmoid.                                             the largest probability, which might also contains
                                                                  the dummy class TH.
Evidence Prediction Head. The evidence ex-
                                                                     We also extract the evidence from the joint
traction model predicts whether each sentence si                                      0 . For simplicity, we predict
                                                                  model, noted as Vh,t
is an evidence sentence of entity pair (eh , et )2 .
                                                                  si as an evidence sentence if P (si |eh , et ) > 0.5.
To obtain its sentence embedding si , we apply a
mean poolingP over all the tokens in the sentence:                3.2   Evidence-centered Relation Extraction
si = |s1i | hl ∈si (hl ) . The mean pooling shows
                                                                  Suppose we are given the ground truth evidence
better performance compared to LogSumExp pool-
                                                                  during inference and the evidence are labeled per-
ing in experiments.
                                                                  fectly, which is to say, they already contain all the
   Intuitively, the tokens contributing more to ch,t
                                                                  information relevant to the relation, then there is no
are more important to both eh and et , and hence
                                                                  need for us to use the whole document during rela-
may be relevant to the relation prediction. Simi-
                                                                  tion extraction. Instead, we can construct a pseudo
larly, if si is an evidence sentence of (eh , et ), the
                                                                  document d0h,t for each entity pair (eh , et ) by con-
tokens in si would also be relevant to the relation
                                                                  catenating the evidence sentences Vh,t in the order
prediction. Hence, a function with both ch,t and si
                                                                  they are presented in the original document, and
as input could be helpful to predict whether si is
                                                                  feed the pseudo documents to the trained model.
in the evidence Vh,t . Specifically, we use a bilinear
                                                                      Noticing that the evidence information is only
function between context embedding ch,t and sen-
                                                                  available during training, we replace the evidence
tence embedding si to measure the importance of
                                                                  sentences in the construction of pseudo documents
sentence si to entity pair (eh , et ):
                                                                  with the evidence extracted by our model, noted as
                                                                     0 , and obtain another set of prediction scores by
                                                                  Vh,t
         P (si |eh , et ) = σ (si Wv ch,t + bv ) ,         (6)
                                                                  Eq. 9. We note the prediction scores from origi-
where Wv and bv are learnable parameters.                         nal documents and pseudo documents as S (O) and
   As an entity pair may have more than one evi-                  S (E) , respectively. Since the same trained model is
dence sentence, we use the binary cross entropy as                used for prediction with both the original document
the objective to train the evidence extraction model.             and the extracted evidence. This means that our
                                                                  method does not require training multiple rounds
                         X
          LEvi = −             yi · P (si |eh , et ) +            and is still comparable to other methods with only
                       si ∈D                               (7)    one round of prediction.
            (1 − yi ) · log(1 − P (si |eh , et )),                3.3   Fusion of Extraction Results
where yi is 1 when si ∈ Vh,t and yi = 0 otherwise.                On one hand, if the extracted evidence is com-
Optimization. Finally, we optimize our model by                   pletely accurate, directly using the extracted ev-
the combination of the relation extraction loss LRE               idence for prediction may simplify the structure of
and evidence extraction loss LEvi :                               the original document, and hence make it easier for
                                                                  the model to make the correct predictions. On the
                 L = LRE + α · LEvi ,                      (8)    other hand, the quality of the extracted evidence is
                                                                  not perfect. Besides, the non-evidence sentences
where α is a hyper-parameter that balances the two                in the original document may also provide back-
losses.                                                           ground information of the entities and is possible
    2
      The evidence information is available during training but   to contribute to the prediction. Hence, solely rely-
is not required during inference.                                 ing on these sentences may result in information
loss and lead to sub-optimal performance. As a              2020), CorefBERT (Ye et al., 2020), HIN (Tang
result, we combine the prediction results on both           et al., 2020), ATLOP (Zhou et al., 2021) and Do-
the original documents and the extracted evidence.          cuNet (Yuan et al., 2021). For fair comparisons, we
   After obtaining two sets of relation prediction          use BERTbase as the base encoder for all methods.
results from the original documents and the pseudo          Evaluation Metrics. Following prior studies (Yao
documents, we fuse the results by aggregating their         et al., 2019; Zhou et al., 2021), we use F1 and Ign
prediction scores S (O) and S (E) through a blending        F1 as the main evaluation metrics. Ign F1 measures
layer (Wolpert, 1992):                                      the F1 score excluding the relations shared by the
                             (O)      (E)                   training and development/test set. We also report
    PF use (r|eh , et ) = σ(Sh,t,r + Sh,t,r − τ ), (10)     Intra F1 and Inter F1, where the former measures
                                                            the performance on the co-occurred entity pairs
where τ is a learnable parameter. We optimize the
                                                            (intra-sentence) and the latter measures the perfor-
parameter τ on the developing set as follows:
                                                            mance on inter-sentence relations where none of
 LF use = −
              XXX
                          yr · PF use (r|eh , et ) +        the entity mention pairs co-occur.
                d∈D h6=t r∈R                                Implementation Details. Our model is imple-
               (1 − yr ) · log(1 − PF use (r|eh , et )),    mented based on Pytorch and Huggingface’s Trans-
                                                     (11)   formers (Wolf et al., 2019). We use cased BERT-
where yr = 1 if the relation r holds between                based (Devlin et al., 2019) as the base encoder and
(eh , et ) and yr = 0 otherwise.                            optimize our model using AdamW with learning
                                                            rate 5e−5 for the base encoder and 1e−4 for other
4     Experiments                                           parameters. We adopt a linear warmup for the first
                                                            6% steps. The batch size (number of documents
4.1    Experiment Setup
                                                            per batch) is set to 4 and the ratio between relation
Dataset. We evaluate all compared methods using             extraction loss and evidence extraction loss α is
the DocRED benchmark (Yao et al., 2019), a large            set to 0.1. We perform early stopping based on the
human-annotated document-level RE dataset that              F1 score on the development set, with a maximum
consists of 3,053/1,000/1,000 documents for train-          of 30 epochs. All our models are trained with one
ing/development/testing, respectively. DocRED is            GTX 1080 Ti GPU.
constructed from Wikipedia, involving 96 relation
types, 132,275 entities, and 56,354 relations. More         4.2   Main Results
than 40.7% relations can only be extracted by con-          Relation Extraction Results. Table 1 presents the
sidering multiple sentences. The dataset provides           relation extraction results of E IDER and baseline
evidence sentences as part of the annotation. How-          models on the DocRED dataset. First, we can ob-
ever, during inference stage, our model does not            serve that our method outperforms all the baseline
have access to this information.                            methods in terms of all compared metrics on both
Baseline Methods. We compare E IDER with state-             the development set and test set. Furthermore, com-
of-the-art DocRE methods from two categories.               pared to ATLOP (Zhou et al., 2021), which uses the
The first one includes graph-based methods that             same base relation extraction model as our method,
construct document graphs by simple rules such              E IDER improves its performance significantly by
as entity co-occurrence or co-reference and ex-             1.56/1.40 F1/Ign F1 on the development set and
plicitly perform inference on the graph. Exam-              1.17/1.09 on the test set. This demonstrates the
ples include LSR (Nan et al., 2020), GAIN (Zeng             usefulness of joint extraction and integration of
et al., 2020), GLRE (Wang et al., 2020), DRN (Xu            extracted evidence during inference.
et al., 2021) and SIRE (Zeng et al., 2021). The                The experiment results also show that E IDER per-
second category contains transformer-based meth-            forms better than previous methods by 1.23/0.48
ods that adopt the Transformer architecture to cap-         Inter F1/Intra F1 on the development set. We notice
ture long-distance token dependencies implicitly            that the improvement on Inter F1 is much larger
and model cross-sentence relations. Our model               than that on Intra F1, which indicates that using
also falls in this category, and we compare it              evidence extraction as an auxiliary task and utiliz-
with BERT (Devlin et al., 2019), BERT-Two-                  ing the extracted evidence in the inference stage
Step (Wang et al., 2019), E2GRE (Huang et al.,              can largely improve the inter-sentence prediction
Dev                            Test
          Model
                                                  Ign F1    F1      Intra F1   Inter F1     Ign F1       F1
          LSR-BERTbase (Nan et al., 2020)         52.43    59.00      65.26     52.05       56.97     59.05
          GLRE-BERTbase (Wang et al., 2020)         -        -          -         -         55.40     57.40
          GAIN-BERTbase (Zeng et al., 2020)       59.14    61.22      67.10     53.90       59.00     61.24
          DRN-BERTbase (Xu et al., 2021)          59.33    61.39        -         -         59.15     61.37
          SIRE-BERTbase (Zeng et al., 2021)       59.82    61.60      68.07     54.01       60.18     62.05
          BERTbase (Wang et al., 2019)              -      54.16     61.61      47.15         -       53.20
          BERT-Two-Step (Wang et al., 2019)         -      54.42     61.80      47.28         -       53.92
          HIN-BERTbase (Tang et al., 2020)        54.29    56.31        -          -        53.70     55.60
          E2GRE-BERTbase (Huang et al., 2020)     55.22    58.72        -          -          -         -
          CorefBERTbase (Ye et al., 2020)         55.32    57.51        -          -        54.54     56.96
          ATLOP-BERTbase (Zhou et al., 2021)      59.22    61.09     67.44†     53.17†      59.31     61.30
          DocuNet-BERTbase (Yuan et al., 2021)    59.86    61.83        -          -        59.93     61.86
          E IDER-BERTbase                         60.62    62.65      68.55     55.24       60.42     62.47

Table 1: Performance comparison on DocRED. Results with † are based on our implementation. Others are
reported in their original papers. We separate graph-based and transformer-based methods into two groups.

      Model                  Dev F1     Test F1             Ablation           Ign F1      F1     Intra F1    Inter F1
      E2GRE-BERTbase          47.14        -                E IDER-Full        60.62      62.65      68.55     55.24
      E2GRE-RoBERTalarge        -        50.50                 NoJoint         59.98      62.03      68.51     54.10
      E IDER-BERTbase         50.71      51.27                 NoEvi           59.70      61.53      67.55     54.01
                                                               NoOrigDoc       59.25      61.57      67.59     54.10
Table 2: The evidence extraction results on DocRED.            NoBlending      58.93      61.46      67.33     54.37
The performance of E2GRE (Huang et al., 2020) is re-
ported in their original paper.                                  Table 3: Ablation study of E IDER on DocRED.

                                                           training our joint model requires 10,916 MB on a
ability of the model. GAIN and ATLOP have sim-
                                                           single GTX 1080 Ti GPU, while E2GRE fails to
ilar overall F1/Ign F1 scores, but the Inter F1 of
                                                           run on the same GPU and requires 36,182 MB on
GAIN is 0.73 higher and the Intra F1 of ATLOP is
                                                           a RTX A6000 GPU, which shows E IDER is much
0.34 higher. This indicates that this methods may
                                                           more memory-efficient. Moreover, we find that the
capture the long-distance dependency between en-
                                                           standalone relation extraction model training con-
tities by directly connecting them on the graph. Al-
                                                           sumes 9,579 MB GPU memory, which indicates
though E IDER does not involve explicit multi-hop
                                                           that our joint model training incurs less than 14%
reasoning modules, it still significantly outperforms
                                                           GPU memory overhead.
the graph-based models in terms of Inter F1. This
demonstrates that the evidence-centered relation ex-       4.3     Performance Analysis
traction also help E IDER to capture long-distance
                                                           Ablation Studies. We conduct ablation studies to
dependencies between entities and hence infer the
                                                           further analyze the utility of each module in E IDER.
complicated relation from multiple sentences.
                                                           The results are shown in Table 3.
Evidence Extraction Results. We list the re-                  We first train the RE model and the evidence
sults of evidence prediction in Table 2. E IDER-           extraction model separately, denoted as NoJoint.
BERTbase outperforms E2GRE-BERTbase signif-                The performance of Intra F1/Inter F1 drops by
icantly by 3.57 F1 on the development set, and             0.04/1.14. We observe that the drop in Inter F1
ourperforms E2GRE-RoBERTalarge by 0.77 F1 on               is much more significant, which shows that with-
the test set. Note that our model adopts a much sim-       out taking the evidence extraction as an auxiliary
pler structure for evidence extraction. To compare         task, the ability of the RE model for identifying
the memory consumption of E IDER and E2GRE,                the important context is almost the same for intra-
we implement the evidence prediction model of              sentence pairs, but is largely decreased for inter-
E2GRE and jointly train it with the same relation          sentence pairs. Consequently, the RE model fails to
extraction model as ours. Experiments show that            infer complicated relations from multiple sentences
Intra     Coref    Bridge      Total        co-occurs is mainly from combining extracted ev-
                                                          idence during inference. For the Bridge category,
  Count       6711       984      3212      10,907
                                                          the full model outperforms our two ablations by
  Percent    54.46%     7.99%    26.07%     88.52%
                                                          a large margin, and the two ablations further out-
Table 4: The statistics of categories among the 12,323    perform ATLOP by a large margin. This reveals
relations in the DocRED development set.                  that both extracting evidence in training and pre-
                                                          dicting with evidence during inference improve the
if it is trained alone.                                   performance of entity pairs that require multi-hop
   Then, we remove the extracted evidence and             reasoning over a third entity.
the original document in the inference stage sepa-             2.5
rately. These two ablations are denoted as NoEvi                           Ours­Full             NoEvi
                                                               2.0         NoJoint               ATLOP          +2.0
and NoOrigDoc, respectively. We observe that
removing either source will lead to performance                                             +1.6
                                                               1.5

                                                          F1
drops, which indicates both the extracted evidence                    +1.1+1.1           +1.0
and the original document are important for relation           1.0                                                 +0.8
                                                                                                                       +0.7
prediction. The original documents may contain
                                                               0.5                                +0.3
irrelevant and noisy sentences and the extracted evi-                        +0.1 67.5                   61.0                   51.4
dence sentences only may fail to cover all important           0.0
                                                                           Intra                Coref                  Bridge
information in the original document. Therefore, it
                                                          Figure 3: Performance gains in F1 by relation cate-
is necessary to use them both during inference.           gories. The gains are relative to ATLOP.
   Finally, we remove the blending layer by sim-
ply taking the union of the extracted relations of        4.4        Case Studies
original documents and pseudo documents, noted            Table 5 shows a few example output cases of our
as NoBlending. The performance drop sharply               model E IDER. In the first example, the extracted
by 1.19/1.71 F1/Ign F1. It further demonstrates           evidence contains the ground truth evidence, and
that the blending layer can successfully learns a         the prediction on pseudo document is correct. In
dynamic threshold to combine the two sets of pre-         the second example, the 6th sentence is missing
diction results.                                          in the extracted evidence, but fortunately, the pre-
Performance Breakdown. To further analyze the             diction in the original document is correct and the
performance of E IDER on different types of entity        final prediction result is correct. The last example
pairs, we categorize the relations into three cate-       shows a case where E IDER successfully predicts
gories: (1) Intra, where two entities co-occur in         the evidence, but the prediction result with pseudo
the same sentence, (2) Coref, where none of their         document as input is still “NA”. This is consistent
explicit entity mention pairs co-occur, but their co-     with our argument that the non-evidence sentences
reference co-occurs, (3) Bridge, where the first two      in the original document may also provide back-
situations are not satisfied, but there exists a third    ground information of the entities and is possible
entity whose mention co-occurs with both the head         to contribute to the prediction.
entity and the tail entity. The statistics of each cat-
                                                          5     Related Work
egory is listed in Table 4, where the co-reference
of each entity is extracted by HOI (Xu and Choi,          Relation Extraction. Previous research efforts on
2020). From the statistics, we can see that the three     relation extraction mainly concentrate on predict-
categories cover over 88% of the relations in the         ing relations within a sentence (Cai et al., 2016;
development set.                                          Zeng et al., 2015; Feng et al., 2018; Zheng et al.,
   The results on each category are shown in Figure       2021; Zhang et al., 2018, 2019, 2020). While these
3. We can see that our model outperforms ATLOP            approaches tackle the sentence-level RE task ef-
in all three categories. The differences between          fectively, in the real world, certain relations can
models varies per category. Under the Intra and           only be inferred from multiple sentences. Con-
Coref category, our full model and the NoJoint ab-        sequently, recent studies (Quirk and Poon, 2017;
lation significantly outperforms the NoEvi ablation       Peng et al., 2017; Yao et al., 2019; Wang et al.,
and ATLOP, which indicates that the performance           2019; Tang et al., 2020) have proposed to work on
gain for entity pairs whose mention or co-reference       the document-level relation extraction (DocRE).
Ground Truth Relation: Place of birth Ground Truth Evidence Sentence(s): [3]
 Document: [1] Kurt Tucholsky (9 January 1890 – 21 December 1935) was a German - Jewish journalist ,
 satirist , and writer. [2] He also wrote under the pseudonyms Kaspar Hauser (after the historical figure), Peter
 Panter, Theobald Tiger and Ignaz Wrobel. [3] Born in Berlin - Moabit, he moved to Paris in 1924 and then to
 Sweden in 1929. [4] Tucholsky was one of the most important journalists of ...
 Extracted Evidence Sentence(s): [1, 3]
 Prediction based on Orig. Document: NA Prediction based on Extracted Evidences: Place of Birth
 Final Predicted Type: Place of Birth
 Ground Truth Relation: Inception Ground Truth Evidence Sentence(s): [5, 6]
 Document: [1] Oleg Tinkov (born 25 December 1967 ) is a Russian entrepreneur and cycling sponsor. ... [5]
 Tinkoff is the founder and chairman of the Tinkoff Bank board of directors (until 2015 it was called Tinkoff
 Credit Systems). [6] The bank was founded in 2007 and as of December 1, 2016, it is ranked 45 in terms of
 assets and 33 for equity among Russian banks. ...
 Extracted Evidence Sentence(s): [5]
 Prediction based on Orig. Document: Inception Prediction based on Extracted Evidences: NA
 Final Predicted Type: Inception
 Ground Truth Relation: Original network Ground Truth Evidence Sentence(s): [1, 2]
 Document: [1] "How to Save a Life" is the twenty-first episode of the eleventh season of the American
 television medical drama Grey’s Anatomy, and is the 241st episode overall. [2] It aired on April 23, 2015 on
 ABC in the United States. [3] The episode was written by showrunner Shonda Rhimes and directed by Rob
 Hardy, making it the first episode Rhimes has written since the season eight finale "Flight". ...
 Extracted Evidence Sentence(s): [1, 2]
 Prediction based on Orig. Document: Original network Prediction based on Extracted Evidences: NA
 Final Predicted Type: Original network
Table 5: Case studies of our proposed framework E IDER. We use red, blue and green to color the head entity, tail
entity and relation, respectively. The indices of ground truth evidence sentences are highlighted with yellow .

Graph-based DocRE. Graph-based DocRE meth-                      designs several hand-crafted rules to extract sen-
ods generally first construct a graph with men-                 tences that are important to the prediction. Similar
tions, entities, sentences or documents as the nodes,           to our method, Huang et al. (2020) learns a model
and then infer the relations by reasoning on this               to perform joint relation extraction and evidence ex-
graph. Specifically, Nan et al. (2020) constructs a             traction. However, our method uses a much simpler
document-level graph and iteratively updates the                model structure for the evidence extraction model
node representations and refines the graph topologi-            and hence reduces the memory usage and improves
cal structure. Zeng et al. (2020) first constructs both         the training efficiency. We are also the first work
a mention-level graph and an entity-level graph,                to fuse the predictions based on extracted evidence
and then performs multi-hop reasoning on both                   sentences in inference.
graphs. Xu et al. (2020) extracts a reasoning path
between each entity pair holding at least one rela-             6    Conclusion
tion, and encourages the model to reconstruct the
                                                                In this work, we propose a three-stage DocRE
path during training. These methods simplify the
                                                                framework consisting of joint relation and evidence
input document by extracting a graph with entities
                                                                extraction, evidence-centered relation extraction
and performing explicit graph reasoning. However,
                                                                and fusion of extraction results. The joint training
the complicated operations on the graphs lower the
                                                                stage adopts simple model structure and is memory-
efficiency of the methods.
                                                                efficient. The relation extraction and evidence ex-
Transformer-based DocRE. Another line of stud-                  traction model provide additional training signals
ies solely relies on the transformer architecture               to each other and mutually enhance each other. We
(Devlin et al., 2019) to model cross-sentence rela-             combine the prediction results on both the origi-
tions since transformers is able to implicitly cap-             nal document and the extracted evidence, which
ture long-distance dependencies. Zhou et al. (2021)             encourages the model to focus on the important
uses attention in the transformers to extract useful            sentences while reducing information loss. Exper-
context and adopts an adaptive threshold for each               iment results demonstrate that our model signifi-
entity pair. Yuan et al. (2021) views DocRE as a                cantly outperforms existing methods on DocRED,
semantic segmentation task. Huang et al. (2021)                 especially on inter-sentence relations.
References                                               J. Shen, Maryam Karimzadehgan, Michael Bender-
                                                            sky, Zhen Qin, and Donald Metzler. 2018. Multi-
Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016.            task learning for email search ranking with auxiliary
  Bidirectional recurrent convolutional neural network      query clustering. In CIKM.
  for relation classification. In ACL, pages 756–765.
                                                         Y. Shi, Jiaming Shen, Yuchen Li, N. Zhang, Xinwei
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               He, Zhengzhi Lou, Q. Zhu, M. Walker, Myung-
   Kristina Toutanova. 2019. BERT: Pre-training of          Hwan Kim, and Jiawei Han. 2019. Discovering hy-
   deep bidirectional transformers for language under-      pernymy in text-rich heterogeneous information net-
   standing. In Proceedings of the 2019 Conference of       work by exploiting context granularity. In CIKM.
   the North American Chapter of the Association for
   Computational Linguistics, pages 4171–4186.           Hengzhu Tang, Yanan Cao, Zhenyu Zhang, Jiangxia
                                                           Cao, Fang Fang, Shi Wang, and Pengfei Yin. 2020.
Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi-        HIN: hierarchical inference network for document-
  aoyan Zhu. 2018. Reinforcement learning for rela-        level relation extraction. In Advances in Knowledge
  tion classification from noisy data. In AAAI, pages      Discovery and Data Mining - 24th Pacific-Asia Con-
  5779–5786.                                               ference, PAKDD 2020, Singapore, May 11-14, 2020,
                                                           Proceedings, Part I, volume 12084 of Lecture Notes
Pankaj Gupta, Subburam Rajaram, Hinrich Schütze,           in Computer Science, pages 197–209.
  and Thomas A. Runkler. 2019. Neural relation
  extraction within and across sentence boundaries.      Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong
  In The Thirty-Third AAAI Conference on Artificial        Qi, and Rui Zhang. 2019. Neural relation extrac-
  Intelligence, AAAI 2019, The Thirty-First Innova-        tion for knowledge base enrichment. In Proceed-
  tive Applications of Artificial Intelligence Confer-     ings of the 57th Annual Meeting of the Association
  ence, IAAI 2019, The Ninth AAAI Symposium on Ed-         for Computational Linguistics, pages 229–240, Flo-
  ucational Advances in Artificial Intelligence, EAAI      rence, Italy. Association for Computational Linguis-
  2019, pages 6513–6520.                                   tics.

Kevin Huang, Guangtao Wang, Tengyu Ma, and Jing          Difeng Wang, Wei Hu, Ermei Cao, and Weijian
  Huang. 2020. Entity and evidence guided relation         Sun. 2020. Global-to-local neural networks for
  extraction for docred.                                   document-level relation extraction. In Proceed-
                                                           ings of the 2020 Conference on Empirical Methods
Quzhe Huang, Shengqi Zhu, Yansong Feng, Yuan Ye,           in Natural Language Processing (EMNLP), pages
  Yuxuan Lai, and Dongyan Zhao. 2021. Three sen-           3711–3721.
  tences are all you need: Local path enhanced docu-
  ment relation extraction.                              Hong Wang, Christfried Focke, Rob Sylvester, Nilesh
                                                           Mishra, and William Wang. 2019. Fine-tune bert for
                                                           docred with two-step process. Computing Research
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
                                                           Repository, arXiv:1909.11898.
  feng Gao. 2019. Multi-task deep neural networks
  for natural language understanding. In ACL.
                                                         Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
                                                           Chaumond, Clement Delangue, Anthony Moi, Pier-
Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu.       ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  2020. Reasoning with latent structure refinement for     icz, et al. 2019. Huggingface’s transformers: State-
  document-level relation extraction. In Proceedings       of-the-art natural language processing. ArXiv, pages
  of the 58th Annual Meeting of the Association for        arXiv–1910.
  Computational Linguistics, pages 1546–1557.
                                                         David H. Wolpert. 1992. Stacked generalization. Neu-
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina           ral Networks, 5:241–259.
  Toutanova, and Wen-tau Yih. 2017. Cross-sentence
  n-ary relation extraction with graph LSTMs. Trans-     Liyan Xu and Jinho D. Choi. 2020. Revealing the
  actions of the Association for Computational Lin-        myth of higher-order inference in coreference reso-
  guistics, 5:101–115.                                     lution. In Proceedings of the 2020 Conference on
                                                           Empirical Methods in Natural Language Processing
Chris Quirk and Hoifung Poon. 2017. Distant super-         (EMNLP), pages 8527–8533. Association for Com-
  vision for relation extraction beyond the sentence       putational Linguistics.
  boundary. In Proceedings of the 15th Conference of
  the European Chapter of the Association for Compu-     Wang Xu, Kehai Chen, and Tiejun Zhao. 2020.
  tational Linguistics: Volume 1, Long Papers, pages       Document-level relation extraction with reconstruc-
  1171–1182.                                               tion.

Sebastian Ruder. 2017.    An overview of multi-          Wang Xu, Kehai Chen, and Tiejun Zhao. 2021. Dis-
  task learning in deep neural networks. ArXiv,            criminative reasoning for document-level relation
  abs/1706.05098.                                          extraction.
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,         Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing
  Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,         Huang. 2021. Document-level relation extraction
  and Maosong Sun. 2019. DocRED: A large-scale             with adaptive thresholding and localized context
  document-level relation extraction dataset. In Pro-       pooling. In Proceedings of the AAAI Conference on
  ceedings of the 57th Annual Meeting of the Associa-      Artificial Intelligence.
  tion for Computational Linguistics, pages 764–777.

Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng
  Li, Maosong Sun, and Zhiyuan Liu. 2020. Corefer-
  ential Reasoning Learning for Language Represen-
  tation. In Proceedings of the 2020 Conference on
  Empirical Methods in Natural Language Processing
  (EMNLP), pages 7170–7186.

Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos
 Santos, Bing Xiang, and Bowen Zhou. 2017. Im-
 proved neural relation detection for knowledge base
 question answering. In Proceedings of the 55th An-
 nual Meeting of the Association for Computational
 Linguistics (Volume 1: Long Papers), pages 571–
 581, Vancouver, Canada. Association for Computa-
 tional Linguistics.

Changsen Yuan, Heyan Huang, Chong Feng, Ge Shi,
  and Xiaochi Wei. 2021. Document-level relation ex-
  traction with entity-selection attention. Information
  Sciences, 568:163–174.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.
  2015. Distant supervision for relation extraction
  via piecewise convolutional neural networks. In
  EMNLP, pages 1753–1762.

Shuang Zeng, Yuting Wu, and Baobao Chang. 2021.
  Sire: Separate intra- and inter-sentential reasoning
  for document-level relation extraction.

Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li.
  2020. Double graph based reasoning for document-
  level relation extraction. In Proceedings of the 2020
  Conference on Empirical Methods in Natural Lan-
  guage Processing (EMNLP), pages 1630–1640.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Jiaoyan
  Chen, Wei Zhang, and Huajun Chen. 2020. Rela-
  tion adversarial network for low resource knowledge
  graph completion. In Proceedings of The Web Con-
  ference 2020.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Xi Chen,
  Wei Zhang, and Huajun Chen. 2018. Attention-
  based capsule networks with dynamic routing for re-
  lation extraction. In EMNLP.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guany-
  ing Wang, Xi Chen, Wei Zhang, and Huajun Chen.
  2019. Long-tail relation extraction via knowledge
  graph embeddings and graph convolution networks.
  In NAACL-HLT.

Hengyi Zheng, Rui Wen, Xi Chen, Yifan Yang, Yun-
  nan Zhang, Ziheng Zhang, Ningyu Zhang, Bin Qin,
  Xu Ming, and Yefeng Zheng. 2021. Prgc: Potential
  relation and global correpondence based joint rela-
  tional triple extraction. In ACL.
You can also read