End-to-end Biomedical Entity Linking with Span-based Dictionary Matching

Page created by Kenneth Bush
 
CONTINUE READING
End-to-end Biomedical Entity Linking
                            with Span-based Dictionary Matching
                                   Shogo Ujiie♠ Hayate Iso♥∗
                    Shuntaro Yada♠ Shoko Wakamiya♠ Eiji Aramaki♠
                  ♠
                    Nara Institute of Science and Technology ♥ Megagon Labs
                {ujiie, yada-s, wakamiya, aramaki}@is.naist.jp
                                      hayate@magagon.ai

                      Abstract                                 recognized mention (Leaman et al., 2013; Ferré
                                                               et al., 2020; Xu et al., 2020; Vashishth et al., 2020),
    Disease name recognition and normalization                 a few approaches employ a joint learning architec-
    , which is generally called biomedical entity              ture for these tasks (Leaman and Lu, 2016; Lou
    linking, is a fundamental process in biomedi-
                                                               et al., 2017). These joint approaches simultane-
    cal text mining. Recently, neural joint learning
    of both tasks has been proposed to utilize the             ously recognize and normalize disease names uti-
    mutual benefits. While this approach achieves              lizing their mutual benefits. For example, Leaman
    high performance, disease concepts that do not             et al. (2013) demonstrated that dictionary-matching
    appear in the training dataset cannot be accu-             features, which are commonly used for NEN, are
    rately predicted. This study introduces a novel            also effective for NER. While these joint learning
    end-to-end approach that combines span rep-                models achieve high performance for both NER
    resentations with dictionary-matching features
                                                               and NEN, they predominately rely on hand-crafted
    to address this problem. Our model handles
    unseen concepts by referring to a dictionary               features, which are difficult to construct because of
    while maintaining the performance of neural                the domain knowledge requirement.
    network-based models, in an end-to-end fash-                  Recently, a neural network (NN)-based model
    ion. Experiments using two major datasets                  that does not require any hand-crafted features
    demonstrate that our model achieved compet-                was applied to the joint learning of NER and
    itive results with strong baselines, especially            NEN (Zhao et al., 2019). NER and NEN were
    for unseen concepts during training.
                                                               defined as two token-level classification tasks, i.e.,
                                                               their model classified each token into IOB2 tags
1 Introduction
                                                               and concepts, respectively. Although their model
Identifying disease names , which is generally                 achieved the state-of-the-art performance for both
called biomedical entity linking, is the fundamental           NER and NEN, a concept that does not appear in
process of biomedical natural language processing, training data (i.e., zero-shot situation) can not be
and it can be utilized in applications such as a lit- predicted properly.
erature search system (Lee et al., 2016) and a                    One possible approach to handle this zero-shot
biomedical relation extraction (Xu et al., 2016). situation is utilizing the dictionary-matching fea-
The usual system to identify disease names consists            tures. Suppose that an input sentence “Classic pol-
of two modules: named entity recognition (NER) yarteritis nodosa is a systemic vasculitis” is given,
and named entity normalization (NEN). NER is                   where “polyarteritis nodosa” is the target entity.
the task that recognizes the span of a disease name, Even if it does not appear in the training data, it
from the start position to the end position. NEN is            can be recognized and normalized by referring to
the post-processing of NER, normalizing a disease              a controlled vocabulary that contains “Polyarteri-
name into a controlled vocabulary, such as a MeSH              tis Nodosa (MeSH: D010488).” Combining such
or Online Mendelian Inheritance in Man (OMIM). looking-up mechanisms with NN-based models,
   Although most previous studies have developed               however, is not a trivial task; dictionary matching
pipeline systems, in which the NER model first rec- must be performed at the entity-level, whereas stan-
ognizs disease mentions (Lee et al., 2020; Weber               dard NN-based NER and NEN tasks are performed
et al., 2020) and the NEN model normalizes the                 at the token-level (for example, Zhao et al., 2019).
    ∗
      Work done while at Nara Institute of Science and Tech-      To overcome this problem, we propose a novel
nology.                                                        end-to-end approach for NER and NEN that com-
                                                            162
                               Proceedings of the BioNLP 2021 workshop, pages 162–167
                            June 11, 2021. ©2021 Association for Computational Linguistics
                  
                                                             …

                                                                                          
                                                                                                            
       
                                       …                                                       …         
                                                                                                                         
       
                        
           

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                    
                                                                                                            

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                   
                                                                                   
        

                                                                                   

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                               …
                        
           
                                                                                                               

                                                                                                       

                                                                                           …

                                                …             …

Figure 1: The overview of our model. It combines the dictionary-matching scores with the context score obtained
from PubMedBERT. The red boxes are the target span and “ci” in the figure is the “i”-th concept in the dictionary.

bines dictionary-matching features with NN-based        model is highly effective for normalization, espe-
models. Based on the span-based model introduced        cially in the zero-shot situations.
by Lee et al. (2017), our model first computes
span representations for all possible spans of the      2 Methods
input sentence and then combines the dictionary- 2.1 Task Definition
matching features with the span representations.
                                                        Given an input sentence, which is a sequence of
Using the score obtained from both features, it
                                                        words x = {x1 , x2 , · · · , x|X| } in the biomedi-
directly classifies the disease concept. Thus, our
                                                        cal literature, let us define S as a set of all pos-
model can handle the zero-shot problem by using
                                                        sible spans, and L as a set of concepts that con-
dictionary-matching features while maintaining the
                                                        tains the special label Null for a non-disease span.
performance of the NN-based models.
                                                        Our goal is to predict a set of labeled spans y =
   Our model is also effective in situations other                   |Y |
than the zero-shot condition. Consider the follow- {hi, j, dik }k=1 , where (i, j) ∈ S is the word in-
ing input sentence: “We report the case of a patient    dex in the sentence, and d ∈ L is the concept of
who developed acute hepatitis,” where “hepatitis” diseases.
is the target entity that should be normalized to       2.2 Model Architecture
“drug-induced hepatitis.” While the longer span
                                                        Our model predicts the concepts for each span
“acute hepatitis” also appears plausible for stand-
                                                        based on the score, which is represented by the
alone NER models, our end-to-end architecture
                                                        weighted sum of two factors: the context score
assigns a higher score to the correct shorter span
                                                        scorecont obtained from span representations and
“hepatitis” due to the existence of the normalized
                                                        the dictionary-matching score scoredict . Figure 1
term (“drug-induced hepatitis”) in the dictionary.
                                                        illustrates the overall architecture of our model. We
   Through the experiments using two major NER
                                                        denote the score of the span s as follows:
and NEN corpora, we demonstrate that our model
achieves competitive results for both corpora.           score(s, c) = scorecont (s, c) + λscoredict (s, c)
Further analysis illustrates that the dictionary-
matching features improve the performance of            where c ∈ L is the candidate concept and λ is the
NEN in the zero-shot and other situations.              hyperparameter that balances the scores. For the
   Our main contributions are twofold: (i) We           concept prediction, the scores of all possible spans
propose a novel end-to-end model for disease            and concepts are calculated, and then the concept
name recognition and normalization that utilizes        with the highest score is selected as the predicted
both NN-based features and dictionary-matching          concept for each span as follows:
features; (ii) We demonstrate that combining
                                                                       y = arg max score(s, c)
dictionary-matching features with an NN-based                                 c∈L
                                                     163
Context score The context score is computed in         3     Experiment
a similar way to that of Lee et al. (2017), which is
                                                       3.1    Datasets
based on the span representations. To compute the
representations of each span, the input tokens are     To evaluate our model, we chose two major datasets
first encoded into the token embeddings. We used       used in disease name recognition and normal-
BioBERT (Lee et al., 2020) as the encoder, which       ization against a popular controlled vocabulary,
is a variation of bidirectional encoder representa-    MEDIC (Davis et al., 2012). Both datasets, the
tions from transformers (BERT) that is trained on      National Center for Biotechnology Information
a large amount of biomedical text. Given an in-        Disease (NCBID) corpus (Doğan et al., 2014)
put sentence containing T words, we can obtain         and the BioCreative V Chemical Disease Relation
the contextualized embeddings of each token using      (BC5CDR) task corpus (Li et al., 2016), comprise
BioBERT as follows:                                    of PubMed titles and abstracts annotated with dis-
                                                       ease names and their corresponding normalized
       h1:T = BERT(x1 , x2 , · · · , xT )              term IDs (CUIs). NCBID provides 593 training,
                                                       100 development, and 100 test data splits, while
where h1:T is the input tokens embeddings.             BC5CDR evenly divides 1500 data into the three
  Span representations are obtained by concatenat-     sets. We adopted the same version of MEDIC as
ing several features from the token embeddings:        TaggerOne (Leaman and Lu, 2016) used, and that
                                                       we dismissed non-disease entity annotations con-
      gs = [hstart(s) , hend(s) , ĥs , φ(s)]
                                                       tained in BC5CDR.
             g0 s = GELU(FFNN(gs ))
                                                       3.2    Baseline Models
where hstart(s) and hend(s) are the start and end      We compared several baselines to evaluate our
token embeddings of the span, respectively; and hˆs    model. DNorm (Leaman et al., 2013) and
is the weighted sum of the token embeddings in the     NormCo (Wright et al., 2019) were used as pipeline
span, which is obtained using an attention mech-       models due to their high performance. In addition,
anism (Bahdanau et al., 2015). φ(i) is the size of     we used the pipeline systems consisting of state-
span s. These representations gs are then fed into a   of-the-art models: BioBERT (Lee et al., 2020) for
simple feed-forward NN, FFNN, and a nonlinear          NER and BioSyn (Sung et al., 2020) for NEN.
function, GELU (Hendrycks and Gimpel, 2016).              TaggerOne (Leaman and Lu, 2016) and
   Given a particular span representation and a can-   Transition-based model (Lou et al., 2017) are
didate concept as the inputs, we formulate the con-    used as joint-learning models. These models out-
text score as follows:                                 performed the pipeline models in NCBID and
                                                       BC5CDR. For the model introduced by Zhao et al.
          scorecont (s, c) = gs · Wc                   (2019), we cannot reproduce the performance re-
                      g                                ported by them. Instead, we report the performance
where W ∈ R|L|×d is the weight matrix associ-
                                                       of the simple token-level joint learning model based
ated with each concept c, and Wc represents the
                                                       on the BioBERT, which referred as “joint (token)”.
weight vector for the concept c.
Dictionary-matching score We used the cosine          3.3 Implementation
similarity of the TF-IDF vectors as the dictionary- We performed several preprocessing steps: split-
matching features. Because there are several syn- ting the text into sentences using the NLTK toolkit
onyms for a concept, we calculated the cosine sim- (Bird et al., 2009), removing punctuations, and
ilarity for all synonyms of the concept and used      resolving abbreviations using Ab3P (Sohn et al.,
the maximum cosine similarity as the score for        2008), a common abbreviation resolution module.
each concept. The TF-IDF is calculated using the      We also merged disease names in each training set
character-level n-gram statistics computed for all    into a controlled vocabulary, following the methods
diseases appearing in the training dataset and con- of Lou et al. (2017).
trolled vocabulary. For example, given the span          For training, we set the learning rate to 5e-5, and
“breast cancer,” synonyms with high cosine simi- mini-batch size to 32. λ was set to 0.9 using the
larity are “breast cancer (1.0)” and “male breast     development sets. For BC5CDR, we trained the
cancer (0.829).”                                      model using both the training and development sets
                                                   164
NCBID          BC5CDR                            standard               zero-shot
    Models                    NER     NEN     NER     NEN       dataset      mention   concept       mention    concept
    TaggerOne                 0.829   0.807   0.826   0.837     NCBID         781        135          179        56
    Transition-based model    0.821   0.826   0.862   0.876     BC5CDR        4031       461          391        179
    NormCo                    0.829   0.840   0.826   0.830
    pipeline                  0.874   0.841   0.865   0.818   Table 2: Number of mentions and concepts in standard
    joint (token)             0.864   0.765   0.855   0.817
                                                              and zero-shot situations.
    Ours without dictionary   0.884   0.781   0.864   0.808
    Ours                      0.891   0.854   0.867   0.851
                                                                           Methods                     NCBID      BC5CDR
Table 1: F1 scores of NER and NEN in NCBID and                 zero-shot
                                                                           Ours without dictionary        0            0
BC5CDR. Bold font represents the highest score.                            Ours                         0.704        0.597
                                                                           Ours without dictionary      0.854        0.846
                                                                standard
                                                                           Ours                         0.905        0.877
following Leaman and Lu (2016). For computa-
tional efficiency, we only consider spans with up to          Table 3: F1 scores for NEN of NCBID and BC5CDR
10 words.                                                     subsets for zero-shot situation where disease concepts
                                                              do not appear in training data and the standard situation
3.4     Evaluation Metrics                                    where they do appear in training data.
We evaluated the recognition performance of our
model using micro-F1 at the entity level. We con-             data (i.e., standard situation), and disease names
sider the predicted spans as true positive when               with concepts that do not appear in the training
their spans are identical. Following the previ-               data (i.e., the zero-shot situation). Table 2 shows
ous work (Wright et al., 2019; Leaman and Lu,                 the number of mentions and concepts in each situa-
2016), the performance of NEN was evaluated us-               tion. Table 3 displays the results of the zero-shot
ing micro-F1 at the abstract level. If a predicted            and standard situation. The proposed model with
concept was found within the gold standard con-               dictionary-matching features can classify disease
cepts in the abstract, regardless of its location, it         concepts in the zero-shot situation, whereas the
was considered as a true positive.                            NN-based classification model cannot normalize
                                                              the disease names.
4      Results & Discussions                                     The results of the standard situation demonstrate
Table 1 illustrates that our model mostly achieved            that combining dictionary-matching features also
the highest F1-scores in both NER and NEN,                    improves the performance even when target con-
except for the NEN in BC5CDR, in which the                    cepts appear in the training data. This finding im-
transition-based model displays its strength as a             plies that an NN-based model can benefit from
baseline. The proposed model outperformed the                 dictionary-matching features, even if the models
pipeline model of the state-of-the-art models for             can learn from many training data.
both tasks, which demonstrates that the improve-
ment is attributed not to the strength of BioBERT             4.2   Case study
but the model architecture, including the end-      We examined 100 randomly sampled sentences to
to-end approach and combinations of dictionary-     determine the contributions of dictionary-matching
matching features.                                  features. There are 32 samples in which the models
   Comparing the model variation results, adding    predicted concepts correctly by adding dictionary-
dictionary-matching features improved the perfor-   matching features. Most of these samples are dis-
mance in NEN. The results clearly suggest that      ease concepts that do not appear in the training set
dictionary-matching features are effective for NN-  but appear in the dictionary. For example, “pure
based NEN models.                                   red cell aplasis (MeSH: D012010)” is not in the
                                                    BC5CDR training set while the MEDIC contains
4.1 Contribution of Dictionary-Matching             “Pure Red-Cell Aplasias” for “D012010”. In this
To analyze the behavior of our model in the zero- case, a high dictionary-matching score clearly leads
shot situation, we investigated the NEN perfor- to a correct prediction in the zero-shot situation.
mance on two subsets of both corpora: disease          In contrast, there are 32 samples in which the
names with concepts that appear in the training     dictionary-matching features cause errors. The
                                                 165
sources of this error type are typically general dis-   train the parameters representing the importance of
ease names in the MEDIC. For example, “Death            each synonyms.
(MeSH:D003643)” is incorrectly predicted as a dis-
ease concept in NER. Because these words are also
used in the general context, our model overesti-        References
mated their dictionary-matching scores.                 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
   Furthermore, in the remaining samples, our             gio. 2015. Neural machine translation by jointly
model predicted the code properly and the span in-        learning to align and translate. In ICLR.
correctly. For example, although “thoracic hemato-
                                                        Steven Bird, Ewan Klein, and Edward Loper. 2009.
myelia” is labeled as “MeSH: D020758” in the               Natural language processing with Python: Analyz-
BC5CDR test set, our model recognized this as              ing text with the natural language toolkit. O’Reilly
“hematomyelia.” In this case, our model mostly             Media, Inc.
relied on the dictionary-matching features and mis-
                                                        Allan Peter Davis, Thomas C Wiegers, Michael C
classifies the span because ‘hematomyelia” is in          Rosenstein, and Carolyn J Mattingly. 2012. MEDIC:
the MEDIC but not in the training data.                   A practical disease vocabulary used at the com-
                                                          parative toxicogenomics database.        Database,
4.3    Limitations                                        2012:bar065.
Our model is inferior to the transition-based model     Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong
for BC5CDR. One possible reason is that the               Lu. 2014. NCBI disease corpus: A resource for dis-
transition-based model utilizes normalized terms          ease name recognition and concept normalization. J.
that co-occur within a sentence, whereas our model        Biomed. Inform., 47:1–10.
does not. Certain disease names that co-occur           Arnaud Ferré, Louise Deléger, Robert Bossy, Pierre
within a sentence are strongly useful for normaliz-       Zweigenbaum, and Claire Nédellec. 2020. C-Norm:
ing disease names. Although BERT implicitly con-          a neural approach to few-shot entity normalization.
siders the interaction between disease names via          BMC Bioinformatics, 21(Suppl 23):579.
the attention mechanism, a more explicit method is      Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
preferable for normalizing diseases.                      sian error linear units (GELUs). arXiv preprint
   Another limitation is that our model treats the        arXiv:1606.08415.
dictionary entries equally. Because certain terms
                                                        Robert Leaman, Rezarta Islamaj Dogan, and Zhiy-
in the dictionary may also be used for non-disease
                                                          ong Lu. 2013. DNorm: Disease name normaliza-
concepts, such as gene names, we must consider            tion with pairwise learning to rank. Bioinformatics,
the relative importance of each concept.                  29(22):2909–2917.

5     Conclusion                                        Robert Leaman and Zhiyong Lu. 2016. TaggerOne:
                                                          Joint named entity recognition and normaliza-
We proposed a end-to-end model for disease name           tion with semi-markov models. Bioinformatics,
                                                          32(18):2839–2846.
recognition and normalization that combines the
NN-based model with the dictionary-matching             Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
features. Our model achieved highly compet-                Donghyeon Kim, Sunkyu Kim, Chan Ho So,
itive results for the NCBI disease corpus and              and Jaewoo Kang. 2020.         BioBERT: a pre-
                                                           trained biomedical language representation model
BC5CDR corpus, demonstrating that incorporat-
                                                           for biomedical text mining.       Bioinformatics,
ing dictionary-matching features into an NN-based          36(4):1234–1240.
model can improve its performance. Further exper-
iments exhibited that dictionary-matching features      Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-
enable our model to accurately predict the con-            moyer. 2017. End-to-end neural coreference resolu-
                                                           tion. In EMNLP, pages 188–197.
cepts in the zero-shot situation, and they are also
beneficial in the other situation. While the results    Sunwon Lee, Donghyeon Kim, Kyubum Lee, Jae-
illustrate the effectiveness of our model, we found        hoon Choi, Seongsoon Kim, Minji Jeon, Sangrak
several areas for improvement, such as the general         Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan,
                                                           and Jaewoo Kang. 2016. BEST: Next-Generation
terms in the dictionary and the interaction between        biomedical entity search tool for knowledge dis-
disease names within a sentence. A possible future         covery from biomedical literature. PLoS One,
direction to deal with general terms is to jointly         11(10):e0164680.
                                                     166
Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci-
   aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter
   Davis, Carolyn J Mattingly, Thomas C Wiegers, and
   Zhiyong Lu. 2016. BioCreative V CDR task corpus:
   A resource for chemical disease relation extraction.
   Database, 2016:baw068.
Yinxia Lou, Yue Zhang, Tao Qian, Fei Li, Shufeng
  Xiong, and Donghong Ji. 2017. A transition-based
  joint model for disease named entity recognition and
  normalization. Bioinformatics, 33(15):2363–2371.
Sunghwan Sohn, Donald C Comeau, Won Kim, and
  W John Wilbur. 2008. Abbreviation definition iden-
  tification based on automatic precision estimates.
  BMC Bioinformatics, 9:402.
Mujeen Sung, Hwisang Jeon, Jinhyuk Lee, and Jae-
 woo Kang. 2020. Biomedical entity representations
 with synonym marginalization. In ACL, pages 3641–
 3650.
Shikhar Vashishth, Rishabh Joshi, Denis Newman-
  Griffis, Ritam Dutt, and Carolyn Rose. 2020.
  MedType: Improving Medical Entity Linking
  with Semantic Type Prediction. arXiv preprint
  arXiv:2005.00460.
Leon Weber, Jannes Münchmeyer, Tim Rocktäschel,
  Maryam Habibi, and Ulf Leser. 2020. HUNER: im-
  proving biomedical NER with pretraining. Bioinfor-
  matics, 36(1):295–302.
Dustin Wright, Yannis Katsis, Raghav Mehta, and
  Chun-Nan Hsu. 2019. NormCo: Deep disease nor-
  malization for biomedical knowledge base construc-
  tion. In AKBC.
Dongfang Xu, Zeyu Zhang, and Steven Bethard. 2020.
  A Generate-and-Rank framework with semantic
  type regularization for biomedical concept normal-
  ization. In ACL, pages 8452–8464.
Jun Xu, Yonghui Wu, Yaoyun Zhang, Jingqi Wang,
  Hee-Jin Lee, and Hua Xu. 2016. CD-REST: A sys-
  tem for extracting chemical-induced disease relation
  in literature. Database, 2016:baw036.
Sendong Zhao, Ting Liu, Sicheng Zhao, and Fei Wang.
  2019. A neural multi-task learning framework to
  jointly model medical named entity recognition and
  normalization. In AAAI, pages 817–824.

                                                         167
You can also read