A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization

Page created by Reginald Garcia
 
CONTINUE READING
A Generate-and-Rank Framework with Semantic Type Regularization for
                 Biomedical Concept Normalization

                   Dongfang Xu and Zeyu Zhang and Steven Bethard
                                School of Information
                                University of Arizona
                                    Tucson, AZ
             {dongfangxu9,zeyuzhang,bethard}@email.arizona.edu

                     Abstract                                   Concept normalization is a task that maps con-
                                                             cept mentions, the in-text natural-language men-
    Concept normalization, the task of linking tex-          tions of ontological concepts, to concept entries in
    tual mentions of concepts to concepts in an on-
                                                             a standardized ontology or knowledge base. Tech-
    tology, is challenging because ontologies are
    large. In most cases, annotated datasets cover
                                                             niques for concept normalization have been ad-
    only a small sample of the concepts, yet con-            vancing, thanks in part to recent shared tasks in-
    cept normalizers are expected to predict all             cluding clinical disorder normalization in 2013
    concepts in the ontology. In this paper, we              ShARe/CLEF (Suominen et al., 2013) and 2014
    propose an architecture consisting of a candi-           SemEval Task 7 Analysis of Clinical Text (Pradhan
    date generator and a list-wise ranker based on           et al., 2014), and adverse drug event normaliza-
    BERT. The ranker considers pairings of con-              tion in Social Media Mining for Health (SMM4H)
    cept mentions and candidate concepts, allow-
                                                             (Sarker et al., 2018; Weissenbacher et al., 2019).
    ing it to make predictions for any concept, not
    just those seen during training. We further en-          Most existing systems use a string-matching or
    hance this list-wise approach with a semantic            dictionary look-up approach (Leal et al., 2015;
    type regularizer that allows the model to in-            D’Souza and Ng, 2015; Lee et al., 2016), which
    corporate semantic type information from the             are limited to matching morphologically similar
    ontology during training. Our proposed con-              terms, or supervised multi-class classifiers (Be-
    cept normalization framework achieves state-             lousov et al., 2017; Tutubalina et al., 2018; Niu
    of-the-art performance on multiple datasets.
                                                             et al., 2019; Luo et al., 2019a), which may not
                                                             generalize well when there are many concepts in
1   Introduction
                                                             the ontology and the concept types that must be
Mining and analyzing the constantly-growing un-              predicted do not all appear in the training data.
structured text in the bio-medical domain offers                We propose an architecture (shown in Figure 1)
great opportunities to advance scientific discovery          that is able to consider both morphological and se-
(Gonzalez et al., 2015; Fleuren and Alkema, 2015)            mantic information. We first apply a candidate gen-
and improve the clinical care (Rumshisky et al.,             erator to generate a list of candidate concepts, and
2016; Liu et al., 2019). However, lexical and gram-          then use a BERT-based list-wise classifier to rank
matical variations are pervasive in such text, posing        the candidate concepts. This two-step architecture
key challenges for data interoperability and the de-         allows unlikely concept candidates to be filtered
velopment of natural language processing (NLP)               out prior to the final classification, a necessary step
techniques. For instance, heart attack, MI, myocar-          when dealing with ontologies with millions of con-
dial infarction, and cardiovascular stroke all refer         cepts. In contrast to previous list-wise classifiers
to the same concept. It is critical to disambiguate          (Murty et al., 2018) which only take the concept
these terms by linking them with their correspond-           mention as input, our BERT-based list-wise clas-
ing concepts in an ontology or knowledge base.               sifier takes both the concept mention and the can-
Such linking allows downstream tasks (relation ex-           didate concept name as input, and is thus able to
traction, information retrieval, text classification,        handle concepts that never appear in the training
etc.) to access the ontology’s rich knowledge about          data. We further enhance this list-wise approach
biomedical entities, their synonyms, semantic types          with a semantic type regularizer that allows our
and mutual relationships.                                    ranker to leverage semantic type information from

                                                        8452
       Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8452–8464
                          July 5 - 10, 2020. c 2020 Association for Computational Linguistics
Candidate Generator                                                        Ranker
(multi-class BERT classifier or Lucene)                                   (list-wise BERT classifier)

                           C0220870       [CLS] head spinning a little [SEP] Lightheadedness [SEP] Light-headed feeling . . .   0.4
 head spinning a little    C0012833       [CLS] head spinning a little [SEP] Dizzyness [SEP] Dizziness symptom . . .            0.5
                           C0018681       [CLS] head spinning a little [SEP] headache [SEP] head pains . . .                    0.1
                           C0393760
                           ...

            Figure 1: Proposed architecture for concept normalization: candidate generation and ranking.

the ontology during training.                                      and heuristic rules. However, such rule-based ap-
  Our work makes the following contributions:                      proaches struggle when there are great variations
                                                                   between concept mention and concept, which is
    • Our proposed concept normalization frame-
                                                                   common, for example, when comparing social me-
      work achieves state-of-the-art performance on
                                                                   dia text to medical ontologies.
      multiple datasets.
                                                                      Due to the availability of shared tasks and an-
    • We propose a concept normalization frame-                    notated data, the field has shifted toward machine
      work consisting of a candidate generator and                 learning techniques. We divide the machine learn-
      a list-wise classifier. Our framework is easier              ing approaches into two categories, classification
      to train and the list-wise classifier is able to             (Savova et al., 2008; Stevenson et al., 2009; Lim-
      predict concepts never seen during training.                 sopatham and Collier, 2016; Yepes, 2017; Festag
                                                                   and Spreckelsen, 2017; Lee et al., 2017; Tutubalina
    • We introduce a semantic type regularizer
                                                                   et al., 2018; Niu et al., 2019) and learning to rank
      which encourages the model to consider the
                                                                   (Leaman et al., 2013; Liu and Xu, 2017; Li et al.,
      semantic type information of the candidate
                                                                   2017; Nguyen et al., 2018; Murty et al., 2018).
      concepts. This semantic type regularizer im-
                                                                      Most classification-based approaches using deep
      proves performance over the BERT-based list-
                                                                   neural networks have shown strong performance.
      wise classifier on multiple datasets.
                                                                   They differ in using different architectures, such
The code for our proposed generate-and-rank                        as Gated Recurrent Units (GRU) with attention
framework is available at https://github.com/                      mechanisms (Tutubalina et al., 2018), multi-task
dongfang91/Generate-and-Rank-ConNorm.                              learning with auxiliary tasks to generate attention
                                                                   weights (Niu et al., 2019), or pre-trained trans-
2    Related work                                                  former networks (Li et al., 2019; Miftahutdinov
Traditional approaches for concept normalization                   and Tutubalina, 2019); different sources for train-
involve string match and dictionary look-up. These                 ing word embeddings, such as Google News (Lim-
approaches differ in how they construct dictionar-                 sopatham and Collier, 2016) or concept defini-
ies, such as collecting concept mentions from the                  tions from the Unified Medical Language System
labeled data as extra synonyms (Leal et al., 2015;                 (UMLS) Metathesaurus (Festag and Spreckelsen,
Lee et al., 2016), and in different string matching                2017); and different input representations, such
techniques, such as string overlap and edit distance               as using character embeddings (Niu et al., 2019).
(Kate, 2016). Two of the most commonly used                        All classification approaches share the disadvan-
knowledge-intensive concept normalization tools,                   tage that the output space must be the same size as
MetaMap (Aronson, 2001) and cTAKES (Savova                         the number of concepts to be predicted, and thus
et al., 2010) both employ rules to first generate lex-             the output space tends to be small such as 2,200
ical variants for each noun phrase and then conduct                concepts in (Limsopatham and Collier, 2016) and
dictionary look-up for each variant. Several sys-                  around 22,500 concepts in (Weissenbacher et al.,
tems (D’Souza and Ng, 2015; Jonnagaddala et al.,                   2019). Classification approaches also struggle with
2016) have demonstrated that rule-based concept                    concepts that have only a few example mentions in
normalization systems achieve performance com-                     the training data.
petitive with other approaches in a sieve-based ap-                   Researchers have applied point-wise learning
proach that carefully selects combinations and or-                 to rank (Liu and Xu, 2017; Li et al., 2017), pair-
ders of dictionaries, exact and partial matching,                  wise learning to rank (Leaman et al., 2013; Nguyen

                                                             8453
et al., 2018), and list-wise learning to rank (Murty         The main idea of the two-step approach is that
et al., 2018; Ji et al., 2019) on concept normaliza-       we first use a simple and fast system with high re-
tion. Generally, the learning-to-rank approach has         call to generate candidates, and then a more precise
the advantage of reducing the output space by first        system with more discriminative input to rank the
obtaining a smaller list of possible candidate con-        candidates.
cepts via a candidate generator and then ranking
them. DNorm (Leaman et al., 2013), based on a              3.2     Candidate generator
pair-wise learning-to-rank model where both men-           We implement two kinds of candidate generators: a
tions and concept names were represented as TF-            BERT-based multi-class classifier when the number
IDF vectors, was the first to use learning-to-rank         of concepts in the ontology is small, and a Lucene-
for concept normalization and achieved the best            based1 dictionary look-up when there are hundreds
performance in the ShARe/CLEF eHealth 2013                 of thousands of concepts in the ontology.
shared task. List-wise learning-to-rank approaches
are both computationally more efficient than pair-         3.2.1     BERT-based multi-class classifier
wise learning-to-rank (Cao et al., 2007) and em-           BERT (Devlin et al., 2019) is a contextualized
pirically outperform both point-wise and pair-wise         word representation model that has shown great
approaches (Xia et al., 2008). There are two im-           performance in many NLP tasks. Here, we use
plementations of list-wise classifiers using neural        BERT in a multi-class text-classification configu-
networks for concept normalization: Murty et al.           ration as our candidate concept generator. We use
(2018) treat the selection of the best candidate con-      the final hidden vector Vm ∈ RH corresponding
cept as a flat classification problem, losing the abil-    to the first input token ([CLS]) generated from
ity to handle concepts not seen during training; Ji        BERT (m) and a classification layer with weights
et al. (2019) take a generate-and-rank approach            W ∈ R|C|×H , and train the model using a standard
similar to ours, but they do not leverage resources        classification loss:
such as synonyms or semantic type information
from UMLS in their BERT-based ranker.                                LG = y ∗ log(sof tmax(Vm W T ))       (1)

3     Proposed methods                                     where y is a one-hot vector, and |y| = |C|. The
                                                           score for all concepts is calculated as:
3.1    Concept normalization framework
We define a concept mention m as an abbrevia-                           p(C) = sof tmax(Vm W T )           (2)
tion such as “MI”, a noun phrase such as “heart
attack”, or even a short text such as “an obstruc-         We select the top k most probable concepts in p(C)
tion of the blood supply to the heart”. The goal           and feed that list Cm to the ranker.
is then to assign m with a concept c. Formally,            3.2.2     Lucene-based dictionary look-up
given a list of pre-identified concept mentions                      system
M = {m1 , m2 , ..., mn } in the text and an on-
                                                           Multi-pass sieve rule based systems (D’Souza and
tology or knowledge base with a set of concepts
                                                           Ng, 2015; Jonnagaddala et al., 2016; Luo et al.,
C = {c1 , c2 , ..., ct }, the goal of concept normaliza-
                                                           2019b) achieve competitive performance when
tion is to find a mapping function cj = f (mi ) that
                                                           used with the right combinations and orders of
maps each textual mention to its correct concept.
                                                           different dictionaries, exact and partial matching,
   We approach concept normalization in two steps:
                                                           and heuristic rules. Such systems relying on basic
we first use a candidate generator G(m, C) → Cm
                                                           lexical matching algorithms are simple and fast to
to generate a list of candidate concepts Cm for
                                                           implement, but they are only able to generate can-
each mention m, where Cm ⊆ C and |Cm |  |C|.
                                                           didate concepts which are morphologically similar
We then use a candidate ranker R(m, Cm ) → Cˆm ,
                                                           to a given mention.
where Cˆm is a re-ranked list of candidate concepts
                                                              Inspired by the work of Luo et al. (2019b), we
sorted by their relevance, preference, or importance.
                                                           implement a Lucene-based sieve normalization sys-
But unlike information retrieval tasks where the
                                                           tem which consists of the following components
order of candidate concepts in the sorted list Cˆm is
                                                           (see Appendix A.1 for details):
crucial, in concept normalization we care only that
                                                              1
the one true concept is at the top of the list.                   https://lucene.apache.org/

                                                      8454
a. Lucene index over the training data finds all              concept at the top of the list, we propose a semantic
     mentions that exactly match m.                             type regularizer that is optimized when candidate
                                                                concepts with the correct semantic type are ranked
  b. Lucene index over ontology finds concepts                  above candidate concepts with incorrect types. The
     whose preferred name exactly matches m.                    semantic type of the candidate concept is assumed
  c. Lucene index over ontology finds concepts                  correct only if it exactly matches the semantic type
     where at least one synonym of the concept                  of the gold truth concept. If the concept has multi-
     exactly matches m.                                         ple semantic types, all must match. Our semantic
                                                                type regularizer consists of two components:
  d. Lucene index over ontology finds concepts                                           X
     where at least one synonym of the concept has                  Rp (yˆt , yˆp ) =             (m1 + yˆp − yˆt )      (4)
     high character overlap with m.                                                     p∈P (y)
                                                                                         X
The ranked list Cm generated by this system is fed                 Rn (yˆp , yˆn ) =               max (m2 + yˆn − yˆp ) (5)
                                                                                                  n∈N (y)
as input to the candidate ranker.                                                       p∈P (y)

3.3     Candidate ranker                                        where ŷ = V(m,cm ) W T , N (y) is the set of indexes
After the candidate generator produces a list of con-           of candidate concepts with incorrect semantic types
cepts, we use a BERT-based list-wise classifier to              (negative candidates), P (y) (positive candidates)
select the most likely candidate. BERT allows us to             is the complement of N (y), ŷt is the score of the
match morphologically dissimilar (but semantically              gold truth candidate concept, and thus t ∈ P (y).
similar) mentions and concepts, and the list-wise               The margins m1 and m2 are hyper-parameters for
classifier takes both mention and candidate con-                controlling the minimal distances between yˆt and
cepts as input, allowing us to handle concepts that             yˆp and between yˆp and yˆn , respectively. Intuitively,
appear infrequently (or never) in the training data.            Rp tries to push the score of the gold truth concept
   Here, we use BERT similar to a question an-                  above all positive candidates at least by m1 , and
swering configuration, where given a concept men-               Rn tries to push the best scored negative candidate
tion m, the task is to choose the most likely                   below all positive candidates by m2 .
candidate concept cm from all candidate con-                       The final loss function we optimize for the BERT-
cepts Cm . As shown in Figure 1, our classi-                    based list-wise classifier is:
fier input includes the text of the mention m and
all synonyms of the candidate concept cm , and                        L = LR + λRp (yˆt , yˆp ) + µRn (yˆp , yˆn )       (6)
takes the form [CLS] m [SEP] syn1 (cm ) [SEP]
                                                                where λ and µ are hyper-parameters to control the
... [SEP] syns (cm ) [SEP], where syni (cm ) is
                                                                tradeoff between standard classification loss and
the ith synonym of concept cm 2 . We calculate the
                                                                the semantic type regularizer.
final hidden vector V(m,cm ) ∈ RH corresponding
to the first input token ([CLS]) generated from
                                                                4     Experiments
BERT for each such input, and then concatenate
the hidden vectors of all candidate concepts to form            4.1     Datasets
a matrix V(m,Cm ) ∈ R|Cm |×H . We use this matrix               Our experiments are conducted on three social me-
and classification layer weights W ∈ RH , and                   dia datasets, AskAPatient (Limsopatham and Col-
compute a standard classification loss:                         lier, 2016), TwADR-L (Limsopatham and Collier,
                                                                2016), and SMM4H-17 (Sarker et al., 2018), and
      LR = y ∗ log(sof tmax(V(m,Cm ) W T )).             (3)
                                                                one clinical notes dataset, MCN (Luo et al., 2019b).
where y is a one-hot vector, and |y| = |Cm |.                   We summarize dataset characteristics in Table 1.

3.4     Semantic type regularizer                               AskAPatient The AskAPatient dataset3 contains
To encourage the list-wise classifier towards a more              17,324 adverse drug reaction (ADR) annota-
informative ranking than just getting the correct                 tions collected from blog posts. The mentions
                                                                  are mapped to 1,036 medical concepts with
   2
     In preliminary experiments, we tried only the concept’s
                                                                  3
preferred term and several other ways of separating synonyms,       http://dx.doi.org/10.5281/zenodo.
but none of these resulted in better performance.               55013

                                                            8455
Dataset                           AskAPatient   TwADR-L         SMM4H-17                          MCN
        Ontology                  SNOMED-CT & AMT        MedDRA       MedDRA (PT)      SNOMED-CT & RxNorm
        Subset                                    Y             Y               N                         N
        |Contology |                          1,036         2,220          22,500                  434,056
        |STontology |                            22            18              61                       125
        |Cdataset |                           1,036         2,220             513                     3,792
        |M |                                 17,324         5,074           9,149                   13,609
        |Mtrain |                           15665.2        4805.7           5,319                     5,334
        |Mtest |                              866.2         142.7           2,500                     6,925
        |M |/|Cdataset |                      16.72          2.29           17.83                      3.59
        |Ctest − Ctrain |                         0             0              43                     2,256
        |Mtest − Mtrain |/Mtest              39.7%         39.5%           34.7%                     53.9%
        |Mambiguous |/|M |                    1.2%         12.8%            0.8%                      4.5%

Table 1: Dataset statistics, where C is a set of concepts, ST is a set of semantic types, and M is a set of mentions.

    22 semantic types from the subset of System-               concepts in the terminology, and are assigned
    atized Nomenclature Of Medicine-Clinical Term              the CUI-less label.
    (SNOMED-CT) and the Australian Medicines
                                                            A major difference between the datasets is the
    Terminology (AMT). We follow the 10-fold
                                                            space of concepts that systems must consider. For
    cross validation (CV) configuration in Lim-
                                                            AskAPatient and TwADR-L, all concepts in the
    sopatham and Collier (2016) which provides 10
                                                            test data are also in the training data, and in both
    sets of train/dev/test splits.
                                                            cases only a couple thousand concepts have to be
                                                            considered. Both SMM4H-17 and MCN define
TwADR-L The TwADR-L dataset3 contains
                                                            a much larger concept space: SMM4H-17 con-
  5,074 ADR expressions from social media. The
                                                            siders 22,500 concepts (though only 513 appear
  mentions are mapped to 2,220 Medical Dictio-
                                                            in the data) and MCN considers 434,056 (though
  nary for Regulatory Activities (MedDRA) con-
                                                            only 3,792 appear in the data). AskAPatient and
  cepts with 18 semantic types. We again fol-
                                                            TwADR-L have no unseen concepts in their test
  low the 10-fold cross validation configuration
                                                            data, SMM4H-17 has a few (43), while MCN has
  defined by Limsopatham and Collier (2016).
                                                            a huge number (2,256). Even a classifier that per-
                                                            fectly learned all concepts in the training data could
SMM4H-17 The SMM4H-17 dataset 4 consists of
                                                            achieve only 70.15% accuracy on MCN. MCN also
 9,149 manually curated ADR expressions from
                                                            has more unseen mentions: 53.9%, where the other
 tweets. The mentions are mapped to 22,500 con-
                                                            datasets have less than 40%. The MCN dataset is
 cepts with 61 semantic types from MedDRA
                                                            thus harder to memorize, as systems must consider
 Preferred Terms (PTs). We use the 5,319 men-
                                                            many mentions and concepts never seen in training.
 tions from the released set as our training data,
                                                               Unlike the clinical MCN dataset, in the three
 and keep the 2,500 mentions from the original
                                                            social media datasets – AskAPatient, TwADR-L,
 test set as evaluation.
                                                            and SMM4H-17 – it is common for the ADR ex-
                                                            pressions to share no words with their target med-
MCN The MCN dataset consists of 13,609 con-
                                                            ical concepts. For instance, the ADR expression
 cept mentions drawn from 100 discharge sum-
                                                            “makes me like a zombie” is assigned the concept
 maries from the fourth i2b2/VA shared task
                                                            “C1443060” with preferred term “feeling abnor-
 (Uzuner et al., 2011). The mentions are mapped
                                                            mal”. The social media datasets do not include con-
 to 3792 unique concepts out of 434,056 possible
                                                            text, only the mentions themselves, while the MCN
 concepts with 125 semantic types in SNOMED-
                                                            dataset provides the entire note surrounding each
 CT and RxNorm. We take 40 clinical notes from
                                                            mention. Since only 4.5% of mentions in the MCN
 the released data as training, consisting of 5,334
                                                            dataset are ambiguous, for the current experiments
 mentions, and the standard evaluation data with
                                                            we ignore this additional context information.
 6,925 mentions as our test set. Around 2.7% of
 mentions in MCN could not be mapped to any                 4.2     Unified Medical Language System
    4
        http://dx.doi.org/10.17632/rxwfb3tysd.              The UMLS Metathesaurus (Bodenreider, 2004)
1                                                           links similar names for the same concept

                                                        8456
from nearly 200 different vocabularies such as          WordCNN Limsopatham and Collier (2016) use
SNOMED-CT, MedDRA, RxNorm, etc. There                    convolutional neural networks over pre-trained
are over 3.5 million concepts in UMLS, and for           word embeddings to generate a vector represen-
each concept, UMLS also provides the definition,         tation for each mention, and then feed these into
preferred term, synonyms, semantic type, relation-       a softmax layer for multi-class classification.
ships with other concepts, etc.
   In our experiments, we make use of synonyms          WordGRU+Attend+TF-IDF Tutubalina et al.
and semantic type information from UMLS. We              (2018) use a bidirectional GRU with attention
restrict our concepts to the three vocabularies, Med-    over pre-trained word embeddings to generate a
DRA, SNOMED-CT, and RxNorm in the UMLS                   vector representation for each mention, concate-
version 2017AB. For each concept in the ontolo-          nate such vector representations with the cosine
gies of the four datasets, we first find its concept     similarities of the TF-IDF vectors between the
unique identifier (CUI) in UMLS. We then extract         mention and all other concept names, and then
synonyms and semantic type information according         feed the concatenated vector to a softmax layer
to the CUI. Synonyms (English only) are collected        for multi-class classification.
from level 0 terminologies containing vocabulary
                                                        BERT+TF-IDF Miftahutdinov and Tutubalina
sources for which no additional license agreements
                                                         (2019) take similar approach as Tutubalina et al.
are necessary.
                                                         (2018), but use BERT to generate a vector rep-
                                                         resentation for each mention. They concatenate
4.3   Evaluation metrics
                                                         the vector representations with the cosine simi-
For all four datasets, the standard evaluation of        larities of the TF-IDF vectors between the men-
concept normalization systems is accuracy. For the       tion and all other concept names, and then feed
AskAPatient and TwADR-L datasets, which use              the concatenated vector to a softmax layer for
10-fold cross validation, the accuracy metrics are       multi-class classification.
averaged over 10 folds.
                                                        CharCNN+Attend+MT Niu et al. (2019) use a
4.4   Implementation details                             multi-task attentional character-level convolu-
                                                         tion neural network. They first convert the men-
We use the BERT-based multi-class classifier as
                                                         tion into a character embedding matrix. The aux-
the candidate generator on the three social media
                                                         iliary task network takes the embedding matrix
datasets AskAPatient, TwADR-L, and SMM4H-
                                                         as input for a CNN to learn to generate character-
17, and the Lucene-based candidate generator for
                                                         level domain-related importance weights. Such
the MCN dataset. In the social media datasets,
                                                         learned importance weights are concatenated
the number of concepts in the data is small, few
                                                         with the character embedding matrix and fed
test concepts are unseen in the training data, and
                                                         as input to another CNN model with a softmax
there is a greater need to match expressions that are
                                                         layer for multi-class classification.
morphologically dissimilar from medical concepts.
In the clinical MCN dataset, the opposites are true.    CharLSTM+WordLSTM Han et al. (2017) first
   For all experiments, we use BioBERT-base (Lee         use a forward LSTM over each character of the
et al., 2019), which further pre-trains BERT on          mention and its corresponding character class
PubMed abstracts (PubMed) and PubMed Central             such as lowercase or uppercase to generate a
full-text articles (PMC). We use huggingface’s py-       character-level vector representation, then use
torch implementation of BERT5 . We select the best       another bi-directional LSTM over each word of
hyper-parameters based on the performance on dev         the mention to generate a word-level representa-
set. See Appendix A.2 for hyperparameter settings.       tion. They concatenate character-level and word-
                                                         level representations and feed them as input to a
4.5   Comparisons with related methods                   softmax layer for multi-class classification.
We compare our proposed architecture with the
                                                        LR+MeanEmbedding Belousov et al. (2017) cal-
following state-of-the-art systems.
                                                         culate the mean of three different weighted word
  5
    https://github.com/huggingface/                      embeddings pre-trained on GoogleNews, Twit-
transformers                                             ter and DrugTwitter as vector representations for

                                                   8457
TwADR-L         AskAPatient         SMM4H-17
        Approach                                               Dev    Test     Dev      Test       Dev      Test
        WordCNN (Limsopatham and Collier, 2016)                  -   44.78         -   81.41          -        -
        WordGRU+Attend+TF-IDF (Tutubalina et al., 2018)          -       -         -   85.71          -        -
        BERT+TF-IDF (Miftahutdinov and Tutubalina, 2019)         -       -         -       -          -    89.64
        CharCNN+Attend+MT (Niu et al., 2019)                     -   46.46         -   84.65          -        -
        CharLSTM+WordLSTM (Han et al., 2017)                     -       -         -       -          -    87.20
        LR+MeanEmbedding (Belousov et al., 2017)                 -       -         -       -          -    87.70
        BERT                                                 47.08   44.05    88.63    87.52      84.74    87.36
        BERT + BERT-rank                                     48.07   46.32    88.14    87.10      84.44    87.66
        BERT + BERT-rank + ST-reg                            47.98   47.02    88.26    87.46      84.66    88.24
        BERT + gold + BERT-rank                              52.70   49.69    89.06    87.92      88.57    90.16
        BERT + gold + BERT-rank + ST-reg                     52.84   50.81    89.68    88.51      88.87    91.08

Table 2: Comparisons of our proposed concept normalization architecture against the current state-of-the-art per-
formances on TwADR-L, AskAPatient, and SMM4H-17 datasets.

  the mention, where word weights are calculated                                                            MCN
  as inverse document frequency. Such vector rep-               Approach                                  Dev      Test
  resentations are fed as input to a multinomial                Sieve-based (Luo et al., 2019b)             -   76.35
  logistic regression (LR) model for multi-class                Lucene                                          79.25
  classification.                                               Lucene+BERT-rank                      83.56     82.75
                                                                Lucene+BERT-rank+ST-reg               84.44     83.56
Sieve-based Luo et al. (2019b) build a sieve-                   Lucene+gold+BERT-rank                 86.89     84.77
  based normalization model which contains exact-               Lucene+gold+BERT-rank+ST-reg          88.59     86.56
  match and MetaMap (Aronson, 2001) modules.
                                                           Table 3: Accuracy of our proposed concept normaliza-
  Given a mention as input, the exact-match mod-
                                                           tion architecture on MCN dataset.
  ule first looks for mentions in the training data
  that exactly match the input, and then looks for
  concepts from the ontology whose synonyms ex-            Lucene The Lucene-based dictionary look-up.
  actly match the input. If no concepts are found,           When used alone, we take the top-ranked candi-
  the mention is fed into MetaMap. They run                  date concept as the prediction.
  this sieve-based normalization model twice. In
  the first round, the model lower-cases the men-          +BERT-rank The BERT-based list-wise classifier,
  tions and includes acronym/abbreviation tokens             always used in combination with either BERT or
  during dictionary lookup. In the second round,             Lucene as a canddiate generator
  the model lower-cases the mentions spans and
  also removes special tokens such as “'s”,           +ST-reg The semantic type regularizer, always
  “"”, etc.                                             used in combination with BERT-ranker.

Since our focus is individual systems, not ensem-          We also consider the case (+gold) where we artifi-
bles, we compare only to other non-ensembles6 .            cially inject the correct concept into the candidate
                                                           generator’s list if it was not already there.
4.6    Models
                                                           5    Results
We separate out the different contributions from
the following components of our architecture.              Table 2 shows that our complete model, BERT +
                                                           BERT-rank + ST-reg, achieves a new state-of-the-
BERT The BERT-based multi-class classifier.                art on two of the social media test sets, and Table 3
 When used alone, we select the most probable              shows that Lucene + BERT-rank + ST-reg achieves
 concept as the prediction.                                a new state-of-the-art on the clinical MCN test set.
   6
                                                           The TwADR-L dataset is the most difficult, with
    An ensemble of three systems (including CharL-
STM+WordLSTM and LR+MeanEmbedding) achieved 88.7%          our complete model achieving 47.02% accuracy.
accuracy on the SMM4H-17 dataset (Sarker et al., 2018).    In the other datasets, performance of our complete

                                                      8458
model is much higher: 87.46% for AskAPatient,                          Candidates                                     L    BR
88.24% for SMM4H-177 .                                                 Repair of abdominal wall hernia                1    3
   On the TwADR-L, SMM4H-17, and MCN test                              Repair of anterior abdominal wall hernia       2    4
                                                                       Obstructed hernia of anterior abdominal wall   3    5
sets, adding the BERT-based ranker improves per-                       Hernia of abdominal wall                       4    1
formance over the candidate generator alone, and                       Abdominal wall hernia procedure                5    2
adding the semantic type regularization further
improves performance. For example, Lucene                          Table 4: Predicted candidate concepts for mention An
alone achieves 79.25% accuracy on the MCN data,                    abdominal wall hernia and their rankings among the
                                                                   outputs of Lucene (L) and BERT-Ranker (BR). Gold
adding the BERT ranker increases this to 82.75%,
                                                                   concept is Hernia of abdominal wall.
and adding the semantic type regularizer increases
this to 83.56%. On AskAPatient, performance of
                                                                           Candidates                   BR    STR     ST
the full model is similar to just the BERT multi-
                                                                           Influenza-like illness        1      2     DS
class classifier, perhaps because in this case BERT                        Influenza                     2      4     DS
alone already successfully improves the state-of-                          Influenza-like symptoms       3      1     SS
the-art from 85.71% to 87.52%. The +gold setting                           Feeling tired                 4      5      F
                                                                           Muscle cramps in feet         5      3     SS
allows us to answer how well our ranker would
perform if our candidate generator made no mis-                    Table 5: Predicted candidate concepts for mention felt
takes. First, we can see that if the correct concept               like I was coming down with flu and their rankings
is always in the candidate list, our list-based ranker             among the outputs of BERT-Ranker (BR) and BERT-
(+BERT-rank) outperforms the multi-class classi-                   Ranker + semantic type regularizer (STR). Gold con-
fier (BERT) on all test sets. We also see in this                  cept is flu-like symptoms. Semantic types (ST) of the
setting that the benefits of the semantic type regu-               candidates include: disease or syndrome (DS), sign or
larizer are amplified, with test sets of TwADR-L                   symptom (SS), finding (F)
and MCN showing more than 1.00% gain in ac-
curacy from using the regularizer. These findings                  through character overlap (step d.) and several
suggest that improving the quality of the candidate                other concepts have high overlap as well. Thus
generator should be a fruitful future direction.                   Lucene ranks the correct concept 4th in its list. The
   Overall, we see the biggest performance gains                   BERT ranker is able to compare “an abdominal
from our proposed generate-and-rank architecture                   wall hernia” to “Hernia of abdominal wall” and rec-
in the MCN dataset. This is the most realistic set-                ognize that as a better match than the other options,
ting, where the number of candidate concepts is                    re-assigning it to rank 1.
large and many test concepts were never seen dur-                     Table 5 shows an example that illustrates why
ing training. In such cases, we cannot use a multi-                the semantic type regularizer helps. The mention
class classifier as a candidate generator since it                 “felt like I was coming down with flu” in the social
would never generate unseen concepts. Thus, our                    media AskAPatient dataset needs to be mapped to
ranker shines in its ability to sort through the long              the concept with the preferred name “influenza-like
list of possible concepts.                                         symptoms”, which has the semantic type of a sign
                                                                   or symptom. The BERT ranker ranks two disease
6       Qualitative analysis
                                                                   or syndromes higher, placing the correct concept at
Table 4 shows an example that is impossible for                    rank 3. After the semantic type regularizer is added,
the multi-class classifier approach to concept nor-                the system recognizes that the mention should be
malization. The concept mention “an abdominal                      mapped to a sign or symptom, and correctly ranks
wall hernia” in the clinical MCN dataset needs to                  it above the disease or syndromes. Note that this
be mapped to the concept with the preferred name                   happens even though the ranker does not get to see
“Hernia of abdominal wall”, but that concept never                 the semantic type of the input mention at prediction
appeared in the training data. The Lucene-based                    time.
candidate generator finds this concept, but only
    7
                                                                   7     Limitations and future research
     Miftahutdinov and Tutubalina (2019) use the same ar-
chitecture as our BERT-based multi-class classifier (row 7),       The available concept normalization datasets are
but they achieve 89.28% of accuracy on SMM4H-17. We
were unable to replicate this result as their code and parameter   somewhat limited. Lee et al. (2017) notes that
settings were unavailable.                                         AskAPatient and TwADR-L have issues including

                                                              8459
duplicate instances, which can lead to bias in the       R01GM114355 from the National Institute of Gen-
system; many phrases have multiple valid map-            eral Medical Sciences (NIGMS). The computations
pings to concepts but the context necessary to dis-      were done in systems supported by the National Sci-
ambiguate is not part of the dataset; and the 10-fold    ence Foundation under Grant No. 1228509. This
cross-validation makes training complex models           research was supported in part by an appointment
unnecessarily expensive. These datasets are also         to the Oak Ridge National Laboratory Advanced
unrealistic in that all concepts in the test data are    Short-Term Research Opportunity (ASTRO) Pro-
seen during training. Future research should focus       gram, sponsored by the U.S. Department of Energy
on more realistic datasets that follow the approach      and administered by the Oak Ridge Institute for
of MCN in annotating mentions of concepts from           Science and Education. The content is solely the
a large ontology and including the full context.         responsibility of the authors and does not neces-
   Our ability to explore the size of the candidate      sarily represent the official views of the National
list was limited by our available computational re-      Institutes of Health, National Science Foundation,
sources. As the size of the candidate list increases,    or Department of Energy.
the true concept is more likely to be included, but
the number of training instances also increases,
making the computational cost larger, especially         References
for the datasets using 10-fold cross-validation. We      Alan R. Aronson. 2001. Effective mapping of biomed-
chose candidate list sizes as large as we could af-        ical text to the UMLS Metathesaurus: the MetaMap
ford, but there are likely further gains possible with     program. In Proceedings of the AMIA Symposium,
larger candidate lists.                                    pages 17–21. American Medical Informatics Asso-
                                                           ciation.
   Our semantic type regularizer is limited to exact
matching: it checks only whether the semantic type       Maksim Belousov, William Dixon, and Goran Nenadic.
of a candidate exactly matches the semantic type          2017. Using an Ensemble of Generalised Linear and
of the true concept. The UMLS ontology includes           Deep Learning Models in the SMM4H 2017 Medi-
many other relations, such as is-a and part-of re-        cal Concept Normalisation Task. In CEUR Work-
                                                          shop Proceedings, volume 1996, pages 54–58.
lations, and extending our regularizer to encode
such rich semantic knowledge may yield further           Olivier Bodenreider. 2004.    The Unified Med-
improvements in the BERT-based ranker.                     ical Language System (UMLS): integrating
                                                           biomedical terminology. Nucleic Acids Research,
8   Conclusion                                             32(suppl 1):D267–D270.

We propose a concept normalization framework             Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and
consisting of a candidate generator and a list-wise        Hang Li. 2007. Learning to Rank: From Pairwise
                                                           Approach to Listwise Approach. In Proceedings
classifier based on BERT.                                  of the 24th International Conference on Machine
   Because the candidate ranker makes predictions          Learning, pages 129–136. Association for Comput-
over pairs of concept mentions and candidate con-          ing Machinery.
cepts, it is able to predict concepts never seen dur-
                                                         Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
ing training. Our proposed semantic type regu-              Kristina Toutanova. 2019. BERT: Pre-training of
larizer allows the ranker to incorporate semantic           deep bidirectional transformers for language under-
type information into its predictions without re-           standing. In Proceedings of the 2019 Conference
quiring semantic types at prediction time. This             of the North American Chapter of the Association
                                                            for Computational Linguistics: Human Language
generate-and-rank framework achieves state-of-the-
                                                           Technologies, Volume 1 (Long and Short Papers),
art performance on multiple concept normalization           pages 4171–4186, Minneapolis, Minnesota. Associ-
datasets.                                                   ation for Computational Linguistics.

Acknowledgments                                          Jennifer D’Souza and Vincent Ng. 2015. Sieve-based
                                                            entity linking for the biomedical domain. In Pro-
We thank the anonymous reviewers for their in-              ceedings of the 53rd Annual Meeting of the Associ-
sightful comments on an earlier draft of this paper.        ation for Computational Linguistics and the 7th In-
                                                            ternational Joint Conference on Natural Language
This work was supported in part by National In-            Processing (Volume 2: Short Papers), pages 297–
stitutes of Health grant R01LM012918 from the               302, Beijing, China. Association for Computational
National Library of Medicine (NLM) and grant                Linguistics.

                                                    8460
Sven Festag and Cord Spreckelsen. 2017. Word Sense       Fei Li, Yonghao Jin, Weisong Liu, Bhanu Pratap Singh
  Disambiguation of Medical Terms via Recurrent            Rawat, Pengshan Cai, and Hong Yu. 2019. Fine-
  Convolutional Neural Networks. In Health Infor-          Tuning Bidirectional Encoder Representations From
  matics Meets EHealth: Digital InsightInformation-        Transformers (BERT)–Based Models on Large-
  Driven Health & Care. Proceedings of the 11th            Scale Electronic Health Record Notes: An Empiri-
  EHealth2017 Conference, volume 236, pages 8–15.          cal Study. JMIR Med Inform, 7(3):e14830.
Wilco W.M. Fleuren and Wynand Alkema. 2015. Ap-          Haodi Li, Qingcai Chen, Buzhou Tang, Xiaolong
  plication of text mining in the biomedical domain.       Wang, Hua Xu, Baohua Wang, and Dong Huang.
  Methods, 74:97–106.                                      2017. CNN-based ranking for biomedical entity nor-
                                                           malization. BMC Bioinformatics, 18(11):385.
Graciela H. Gonzalez, Tasnia Tahsin, Britton C.
  Goodale, Anna C. Greene, and Casey S. Greene.          Nut Limsopatham and Nigel Collier. 2016. Normalis-
  2015. Recent Advances and Emerging Applications          ing medical concepts in social media texts by learn-
  in Text and Data Mining for Biomedical Discovery.        ing semantic representation. In Proceedings of the
  Briefings in Bioinformatics, 17(1):33–42.                54th Annual Meeting of the Association for Compu-
                                                           tational Linguistics (Volume 1: Long Papers), pages
Sifei Han, Tung Tran, Anthony Rios, and Ramakanth          1014–1023, Berlin, Germany. Association for Com-
   Kavuluru. 2017. Team UKNLP: Detecting ADRs,             putational Linguistics.
   Classifying Medication Intake Messages, and Nor-
   malizing ADR Mentions on Twitter. In CEUR Work-       Feifan Liu, Abhyuday Jagannatha, and Hong Yu. 2019.
   shop Proceedings, volume 1996, pages 49–53.             Towards drug safety surveillance and pharmacovigi-
                                                           lance: Current progress in detecting medication and
Zongcheng Ji, Qiang Wei, and Hua Xu. 2019. Bert-           adverse drug events from electronic health records.
  based ranking for biomedical entity normalization.       Drug Saf, 42:95–97.
  arXiv preprint arXiv:1908.03548.
                                                         Hongwei Liu and Yun Xu. 2017. A Deep Learning
Jitendra Jonnagaddala, Toni Rose Jue, Nai-Wen Chang,       Way for Disease Name Representation and Normal-
   and Hong-Jie Dai. 2016. Improving the dictionary        ization. In Natural Language Processing and Chi-
   lookup approach for disease normalization using en-     nese Computing, pages 151–157. Springer Interna-
   hanced dictionary and query expansion. Database,        tional Publishing.
   2016:baw112.
                                                         Yen-Fu Luo, Weiyi Sun, and Anna Rumshisky. 2019a.
Rohit J. Kate. 2016. Normalizing clinical terms using      A Hybrid Normalization Method for Medical Con-
  learned edit distance patterns. Journal of the Amer-     cepts in Clinical Narrative using Semantic Matching.
  ican Medical Informatics Association, 23(2):380–         In AMIA Joint Summits on Translational Science
  386.                                                     proceedings, volume 2019, pages 732–740. Ameri-
                                                           can Medical Informatics Association.
André Leal, Bruno Martins, and Francisco Couto. 2015.
  ULisboa: Recognition and normalization of medical      Yen-Fu Luo, Weiyi Sun, and Anna Rumshisky. 2019b.
  concepts. In Proceedings of the 9th International        MCN: A comprehensive corpus for medical concept
  Workshop on Semantic Evaluation (SemEval 2015),          normalization. Journal of Biomedical Informatics,
  pages 406–411, Denver, Colorado. Association for         pages 103–132.
  Computational Linguistics.
                                                         Zulfat Miftahutdinov and Elena Tutubalina. 2019.
Robert Leaman, Rezarta Islamaj Doğan, and Zhiy-           Deep neural models for medical concept normal-
  ong Lu. 2013. DNorm: disease name normaliza-             ization in user-generated texts. In Proceedings of
  tion with pairwise learning to rank. Bioinformatics,     the 57th Annual Meeting of the Association for
  29(22):2909–2917.                                        Computational Linguistics: Student Research Work-
                                                           shop, pages 393–399, Florence, Italy. Association
Hsin-Chun Lee, Yi-Yu Hsu, and Hung-Yu Kao. 2016.           for Computational Linguistics.
  AuDis: an automatic CRF-enhanced disease nor-
  malization in biomedical text. Database, 2016.         Shikhar Murty, Patrick Verga, Luke Vilnis, Irena
  Baw091.                                                  Radovanovic, and Andrew McCallum. 2018. Hier-
                                                           archical losses and new resources for fine-grained
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,                    entity typing and linking. In Proceedings of the
   Donghyeon Kim, Sunkyu Kim, Chan Ho So,                  56th Annual Meeting of the Association for Compu-
   and Jaewoo Kang. 2019. BioBERT: a pre-trained           tational Linguistics (Volume 1: Long Papers), pages
   biomedical language representation model for            97–109, Melbourne, Australia. Association for Com-
   biomedical text mining. Bioinformatics. Btz682.         putational Linguistics.
Kathy Lee, Sadid A. Hasan, Oladimeji Farri, Alok         Thanh Ngan Nguyen, Minh Trang Nguyen, and
  Choudhary, and Ankit Agrawal. 2017. Medical Con-         Thanh Hai Dang. 2018. Disease Named Entity Nor-
  cept Normalization for Online User-Generated Texts.      malization Using Pairwise Learning To Rank and
  In 2017 IEEE International Conference on Health-         Deep Learning. Technical report, VNU University
  care Informatics (ICHI), pages 462–469. IEEE.            of Engineering and Technology.

                                                     8461
Jinghao Niu, Yehui Yang, Siheng Zhang, Zhengya Sun,      Elena Tutubalina, Zulfat Miftahutdinov, Sergey
   and Wensheng Zhang. 2019. Multi-task Character-         Nikolenko, and Valentin Malykh. 2018. Medical
   Level Attentional Networks for Medical Concept          concept normalization in social media posts with
   Normalization. Neural Process Lett, 49(3):1239–         recurrent neural networks. Journal of Biomedical
   1256.                                                   Informatics, 84:93–102.

Sameer Pradhan, Noémie Elhadad, Wendy Chapman,          Özlem Uzuner, Brett R. South, Shuying Shen, and
  Suresh Manandhar, and Guergana Savova. 2014.              Scott L. DuVall. 2011. 2010 i2b2/VA challenge on
  SemEval-2014 task 7: Analysis of clinical text. In        concepts, assertions, and relations in clinical text.
  Proceedings of the 8th International Workshop on          Journal of the American Medical Informatics Asso-
  Semantic Evaluation (SemEval 2014), pages 54–62,          ciation, 18(5):552–556.
  Dublin, Ireland. Association for Computational Lin-
  guistics.                                              Davy Weissenbacher, Abeed Sarker, Arjun Magge,
                                                           Ashlynn Daughton, Karen O’Connor, Michael J.
Anna Rumshisky, Marzyeh Ghassemi, Tristan Nau-             Paul, and Graciela Gonzalez-Hernandez. 2019.
  mann, Peter Szolovits, V.M. Castro, T.H. McCoy,          Overview of the fourth social media mining for
  and R.H. Perlis. 2016. Predicting early psychi-          health (SMM4H) shared tasks at ACL 2019. In
  atric readmission with natural language processing       Proceedings of the Fourth Social Media Mining for
  of narrative discharge summaries. Transl Psychia-        Health Applications (#SMM4H) Workshop & Shared
  try, 6(10):e921–e921.                                    Task, pages 21–30, Florence, Italy. Association for
                                                           Computational Linguistics.
Abeed Sarker, Maksim Belousov, Jasper Friedrichs,
  Kai Hakala, Svetlana Kiritchenko, Farrokh              Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and
  Mehryary, Sifei Han, Tung Tran, Anthony Rios,            Hang Li. 2008. Listwise approach to learning to
  Ramakanth Kavuluru, Berry de Bruijn, Filip               rank: theory and algorithm. In Proceedings of the
  Ginter, Debanjan Mahata, Saif M. Mohammad,               25th International Conference on Machine Learn-
  Goran Nenadic, and Graciela Gonzalez-Hernandez.          ing, pages 1192–1199. Association for Computing
  2018. Data and systems for medication-related            Machinery.
  text classification and concept normalization from
  Twitter: insights from the Social Media Mining         Antonio Jimeno Yepes. 2017. Word embeddings
  for Health (SMM4H)-2017 shared task. Journal             and recurrent neural networks based on Long-Short
  of the American Medical Informatics Association,         Term Memory nodes in supervised biomedical word
  25(10):1274–1283.                                        sense disambiguation. Journal of Biomedical Infor-
                                                           matics, 73:137–147.
Guergana K. Savova, Anni R. Coden, Igor L. Somin-
  sky, Rie Johnson, Philip V. Ogren, Piet C. De Groen,
  and Christopher G. Chute. 2008. Word sense disam-
  biguation across two domains: Biomedical literature
  and clinical notes. Journal of Biomedical Informat-
  ics, 41(6):1088–1100.

Guergana K. Savova, James J. Masanz, Philip V. Ogren,
  Jiaping Zheng, Sunghwan Sohn, Karin C. Kipper-
  Schuler, and Christopher G. Chute. 2010. Mayo
  clinical Text Analysis and Knowledge Extraction
  System (cTAKES): architecture, component evalua-
  tion and applications. Journal of the American Med-
  ical Informatics Association, 17(5):507–513.

Mark Stevenson, Yikun Guo, Abdulaziz Alamri, and
 Robert Gaizauskas. 2009.      Disambiguation of
 biomedical abbreviations. In Proceedings of the
 BioNLP 2009 Workshop, pages 71–79, Boulder, Col-
 orado. Association for Computational Linguistics.

Hanna Suominen, Sanna Salanterä, Sumithra Velupil-
  lai, Wendy W. Chapman, Guergana Savova,
  Noemie Elhadad, Sameer Pradhan, Brett R. South,
  Danielle L. Mowery, Gareth J.F. Jones, Johannes
  Leveling, Liadh Kelly, Lorraine Goeuriot, David
  Martinez, and Guido Zuccon. 2013. Overview
  of the ShARe/CLEF eHealth Evaluation Lab 2013.
  In Information Access Evaluation. Multilinguality,
  Multimodality, and Visualization, pages 212–231.
  Springer Berlin Heidelberg.

                                                     8462
A     Appendices                                            along with the concept mention, to the BERT-based
                                                            reranker (f).
A.1    Lucene-based dictionary look-up system
                                                               During training, we used component (e) alone
The lucene-based dictionary look-up system con-             instead of the combination of components (b)-(e)
sists of the following components:                          to generate training instances for the BERT-based
                                                            reranker (f) as it generated many more training
(a) Lucene index over the training data finds all
                                                            examples and resulted in better performance on
    CUI-less mentions that exactly match mention
                                                            the dev set. During evaluation, we used the whole
    m.
                                                            pipeline.
(b) Lucene index over the training data finds CUIs
    of all training mentions that exactly match
    mention m.

 (c) Lucene index over UMLS finds CUIs whose
     preferred name exactly matches mention m.

(d) Lucene index over UMLS finds CUIs where at
    least one synonym of the CUI exactly matches
    mention m.

 (e) Lucene index over UMLS finds CUIs where
     at least one synonym of the CUI has high
     character overlap with mention m. To check
     the character overlap, we run the following
     three rules sequentially: token-level matching,
     fuzzy string matching with a maximum edit
     distance of 2, and character 3-gram matching..
See Figure A1 for the flow of execution across the
components. Whenever there are multiple CUIs
generated from a component (a) to (e), they are fed,

 Concept mention                                                                                 CUI-less

                                                                                                      0

       (a)                    (b)                    (c)                       (d)                   (e)
 Training data     0     Training data    0        UMLS             0        UMLS        0        UMLS
CUI-less mention         CUI mention           preferred name              synonyms             synonyms
  exact-match            exact-match            exact-match               exact-match          partial-match
   (Lucene)                (Lucene)               (Lucene)                  (Lucene)             (Lucene)

      1+                   1      2+              1          2+            1      2+            1         2+

      CUI-less           CUI     CUI            CUI       CUI            CUI     CUI          CUI      CUI
                                  CUI                      CUI                    CUI                   CUI
                                   CUI                      CUI                    CUI                   CUI

                                                     (f)
                                                 BERT-based
                                                  reranker
                                                      1
                                                      CUI

Figure A1: Architecture of the lucene-based dictionary look-up system. The edges out of a search process indi-
cate the number of matches necessary to follow the edge. Outlined nodes are terminal states that represent the
predictions of the system.

                                                      8463
Multi-class                             List-wise
                            AAP TwADR-L SMM4H-17 AAP TwADR-L SMM4H-17 MCN
 learning rate            1e-4            5e-5              5e-5   5e-5      5e-5           3e-5    3e-5
 num train epochs           30              30                40     10        10             20      30
 per gpu train batch size   32              16                32     16        16             16       8
 save steps                487             301               166    976       301            333     250
 warmup steps             1463             903               664    976       301            666     750
 list size (k)               -               -                 -     10        20             10      30
 m1                          -               -                 -    0.0       0.0            0.0     0.1
 m2                          -               -                 -    0.2       0.2            0.2     0.2
 λ                           -               -                 -    0.6       0.4            0.4     0.4
 µ                           -               -                 -    0.6       0.4            0.4     0.8

Table A1: Hyper-parameters for BERT-based multi-class and list-wise classifiers. AAP=AskAPatient. Terms with
underscores are hyper-parameters in huggingface’s pytorch implementation of BERT.

A.2   Hyper-parameters
Table A1 shows the hyper-parameters for our mod-
els. We use huggingface’s pytorch implementation
of BERT. We tune the hyperparameters via grid
search, and select the best BERT hyper-parameters
based on the performance on the dev set.
   To keep the size of the candidate list equal to k
for every mention, we apply the following rules:
if the list does not contain the gold concept and
is already of length k, we inject the correct one
and remove an incorrect candidate; if the list is not
length of k, we inject the gold concept and the most
frequent concepts in the training set to reach k.

                                                     8464
You can also read