Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of the American Medical Informatics Association, 28(3), 2021, 516–532
doi: 10.1093/jamia/ocaa269
Advance Access Publication Date: 15 December 2020
Research and Applications
Research and Applications
Ambiguity in medical concept normalization: An analysis
of types and coverage in electronic health record datasets
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Denis Newman-Griffis ,1,2 Guy Divita,1 Bart Desmet,1 Ayah Zirikly,1
,1,3 and Eric Fosler-Lussier2
Carolyn P. Rose
1
Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA, 2Department of
Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA and 3Language Technologies Institute, Car-
negie Mellon University, Pittsburgh, Pennsylvania, USA
Corresponding Author: Denis Newman-Griffis, 6707 Democracy Blvd, Suite 856, Bethesda, MD 20892, USA; denis.griffis@-
nih.gov
Received 11 February 2020; Revised 13 September 2020; Editorial Decision 11 October 2020; Accepted 17 November 2020
ABSTRACT
Objectives: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental compo-
nent of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been ex-
tensively researched as part of information extraction from biomedical literature, but less is known about the
types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of
ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for ad-
vancing medical concept normalization research.
Materials and Methods: We identified ambiguous strings in datasets derived from the 2 available clinical
corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then
compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language
System (UMLS) to assess how representative available datasets are of ambiguity in clinical language.
Results: We found thatJournal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 517
INTRODUCTION that our findings will spur additional development of tools and
resources for resolving medical concept ambiguity.
Identifying the medical concepts within a document is a key step in
the analysis of medical records and literature. Mapping natural lan-
guage to standardized concepts improves interoperability in docu- Contributions of this work
ment analysis1,2 and provides the ability to leverage rich, concept-
• We demonstrate that existing MCN datasets in EHR data are
based knowledge resources such as the Unified Medical Language
not sufficient to capture ambiguity in MCN, either for evaluating
System (UMLS).3 This process is a fundamental component of di-
MCN systems or developing new MCN models. We analyze the
verse biomedical applications, including clinical trial recruitment,4,5
3 available MCN EHR datasets and show that only a small por-
disease research and precision medicine,6–8 pharmacovigilance and
tion of mention strings have any ambiguity within each dataset,
drug repurposing,9,10 and clinical decision support.11 In this work,
and that these observed ambiguities only capture a small subset
we identify distinct phenomena leading to ambiguity in medical con-
of potential ambiguity, in terms of the concept unique identifiers
cept normalization (MCN) and describe key gaps in current
(CUIs) that match to the strings in the UMLS. Thus, new datasets
approaches and data for normalizing ambiguous clinical language.
focused on ambiguity in clinical language are needed to ensure
Medical concept extraction has 2 components: (1) named entity
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
the effectiveness of MCN methodologies.
recognition (NER), the task of recognizing where concepts are men-
• We show that current MCN EHR datasets do not provide suffi-
tioned in the text, and (2) MCN, the task of assigning canonical
ciently representative normalization data for effective generaliza-
identifiers to concept mentions, in order to unify different ways of
tion, in that they have very few mention strings in common with
referring to the same concept. While MCN has frequently been stud-
one another and little overlap in annotated CUIs. Thus, MCN re-
ied jointly with NER,12–14 recent research has begun to investigate
search should include evaluation on multiple datasets, to mea-
challenges specific to the normalization phase of concept extraction.
sure generalization power.
Three broad challenges emerge in concept normalization. First,
• We present a linguistically motivated and empirically validated
language is productive: practitioners and patients can refer to stan-
typology of distinct phenomena leading to ambiguity in medical
dardized concepts in diverse ways, requiring recognition of novel
concept normalization, and analyze all ambiguous strings within
phrases beyond those in controlled vocabularies.15–18 Second, a sin-
the 3 current MCN EHR datasets in terms of these ambiguity
gle phrase can describe multiple concepts in a way that is more (or
phenomena. We demonstrate that multiple distinct phenomena
different) than the sum of its parts.19,20 Third, a single natural lan-
affect MCN ambiguity, reflecting a variety of semantic and lin-
guage form can be used to refer to multiple distinct concepts, thus
guistic relationships between terms and concepts that inform
yielding ambiguity.
both prediction and evaluation methodologies for medical con-
Word sense disambiguation (WSD) (which often includes phrase
cept normalization. Thus, MCN evaluation strategies should be
disambiguation in the biomedical setting) is thus an integral part of
tailored to account for different relationships between predicted
MCN. WSD has been extensively studied in natural language proc-
labels and annotated labels. Further, MCN methodologies could
essing methodology,21–23 and ambiguous words and phrases in bio-
be significantly enhanced by greater integration of the rich se-
medical literature have been the focus of significant research.24–30
mantic resources of the UMLS.
WSD research in electronic health record (EHR) text, however, has
focused almost exclusively on abbreviations and acronyms.31–35 A
single dataset of 50 ambiguous strings in EHR data has been devel- BACKGROUND AND SIGNIFICANCE
oped and studied25,36 but is not freely available for current research.
Two large-scale EHR datasets, the ShARe corpus14 and a dataset by Linguistic phenomena underpinning clinical ambiguity
Luo et al,37 have been developed for medical concept extraction re- Lexical semantics distinguishes between 2 types of lexical ambiguity:
search and have been significant drivers in MCN research through homonymy and polysemy.42,43 Homonymy occurs when 2 lexical
multiple shared tasks.14,38–41 However, their role in addressing am- items with separate meanings have the same form (eg, “cold” as ref-
biguity in clinical language has not yet been explored. erence to a cold temperature or the common cold). Polysemy occurs
when one lexical item diverges into distinct but related meanings
(eg, “coat” for garment or coat of paint). Polysemy can in turn be
Objective
the result of different phenomena, including default interpretations
To understand the role of benchmark MCN datasets in designing
(“drink” liquid or alcohol), metaphors, and metonymy (usage of a
and evaluating methods to resolve ambiguity in clinical language,
literal association between 2 concepts in a specified domain [eg,
we identified ambiguous strings in 3 benchmark EHR datasets for
“Foley catheter on 4/12”] to indicate a past catheterization proce-
MCN and analyzed the causes of ambiguity they capture. Using lexi-
dure).42,43 While metaphors are dispreferred in the formal setting of
cal semantic theory and the taxonomic and semantic relationships
clinical documentation, the telegraphic nature of medical text44
between concepts captured in the UMLS as a guide, we developed a
lends itself to metonymy by using shorter phrases to refer to more
typology of ambiguity in clinical language and categorized each
specific concepts, such as procedures.45
string in terms of what type of ambiguity it captures. We found that
multiple distinct phenomena cause ambiguity in clinical language
and that the existing datasets are not sufficient to systematically cap- Mapping between biomedical concepts and terms: The
ture these phenomena. Based on our findings, we identified 3 key UMLS
gaps in current research on MCN in clinical text: (1) a lack of repre- The UMLS is a large-scale biomedical knowledge resource that com-
sentative data for ambiguity in clinical language, (2) a need for new bines information from over 140 expert-curated biomedical vocabu-
evaluation strategies for MCN that account for different kinds of laries and standards into a single machine-readable resource. One
relationships between concepts, and (3) underutilization of the rich central component of the UMLS that directly informs our analysis
semantic resources of the UMLS in MCN methodologies. We hope of ambiguity is the Metathesaurus, which groups together synonyms518 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
(distinct phrases with the same meaning, [eg, “common cold” and MATERIALS AND METHODS
“acute rhinitis”]) and lexical variants (modifications of the same
We performed both quantitative and qualitative evaluations of am-
phrase [eg, “acute rhinitis” and “rhinitis, acute”]) of biomedical
biguity in 3 benchmark MCN datasets of EHR data. In this section,
terms and assigns them a single CUI. The diversity of vocabularies
we first introduce the datasets analyzed in this work and define our
included in the UMLS (each designed for a unique purpose), com-
methods for measuring ambiguity in the datasets and in the UMLS.
bined with the expressiveness of human language, means that many
We then describe 2 quantitative analyses of ambiguity measure-
different terms can be associated with any one concept (eg, the con-
ments within individual datasets and a generalization analysis across
cept C0009443 is associated with the terms cold, common cold, and
datasets. Finally, we present our qualitative analysis of ambiguity
acute rhinitis, among others), and any term may be used to refer to
types in MCN datasets.
different concepts in different situations (eg, cold may also refer to
C0009264 Cold Temperature in addition to C0009443, as well as to
a variety of other Metathesaurus concepts), leading to ambiguity.
These mappings between terms and concepts are stored in the
MCN datasets
MRCONSO UMLS table. In addition to the canonical terms stored The effect of ambiguity in normalizing medical concepts has been
researched significantly more in biomedical literature than in clinical
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
in MRCONSO, the UMLS also provides lexical variants of terms,
including morphological stemming, inflectional variants, and agnos- data. In order to identify knowledge gaps and key directions for
tic word order, provided through the SPECIALIST Lexicon and suite MCN in the clinical setting, where ambiguity may have direct im-
of tools.46,47 Lexical variants of English-language terms from pact on automated tools for clinical decision support, we studied the
MRCONSO are provided in the MRXNS_ENG UMLS table. The 3 available English-language EHR corpora with concept normaliza-
MCN datasets used in this study were annotated for mentions of tion annotations: SemEval-2015 Task 14,14 CUILESS2016,19 and
concepts in 2 widely used vocabularies integrated into the UMLS: n2c2 2019 Track 3.37,41 MCN annotations in these datasets are rep-
(1) the U.S. edition of the Systematized Nomenclature of Medicine resented as UMLS CUIs for the concepts being referred to in the
Clinical Terms (SNOMED CT) vocabulary, a comprehensive clini- text; as MCN evaluation is performed based on selection of the
cal healthcare terminology, and (2) RxNorm, a standardized no- specific CUI a given mention is annotated with, we describe dataset
menclature for clinical drugs; we thus restricted our analysis to data annotation and our analyses in terms of the CUIs used rather than
from these 2 vocabularies. the concepts they refer to. Details of these datasets are presented
in Table 1.
Sense relations and ontological distinctions in the SemEval-2015
UMLS Task 14 of the SemEval-2015 competition investigated clinical text
In addition to mappings from terms to concepts, the UMLS Meta- analysis using the ShARe corpus, which consists of 531 clinical
thesaurus includes information on semantic relationships between documents from the MIMIC (Medical Information Mart for Inten-
concepts, such as hierarchical relationships that often correspond to sive Care) dataset54 including discharge summaries, echocardio-
lexical phenomena such as hypernymy and hyponymy, as well as gram, electrocardiogram and radiology reports. Each document was
meronymy and holonymy in biological and chemical structures.42 annotated for mentions of disorders and normalized using CUIs
The UMLS has previously been observed to include not only fine- from SNOMED CT.53 The documents were annotated by 2 profes-
grained ontological distinctions, but also purely epistemological dis- sional medical coders, with high interannotator agreement of 84.6%
tinctions such as associated findings (eg, C0748833 Open fracture CUI matches for mentions with identical spans, and all disagree-
of skull vs C0272487 Open skull fracture without intracranial in- ments were adjudicated to produce the final dataset.38,39 Datasets
jury).48 This yields high productivity for assignment of different derived from subsets of the ShARe corpus have been used as the
CUIs in cases of ontological distinction, such as reference to source for several shared tasks.14,39,40,55 The full corpus was used
“cancer” to mean either general cancer disorders or a specific type for a SemEval-2015 shared task on clinical text analysis,14 split into
of cancer in a context such as a prostate exam, as well what Cruse42 298 documents for training, 133 for development, and 100 for test.
termed propositional synonymy (ie, different senses that yield the In order to preserve the utility of the test set as an unseen data sam-
same propositional logic interpretation). Additionally, the difficulty ple for continuing research, we exclude its 100 documents from our
of interterminology mapping at scale means that synonymous terms analysis, and only analyze the training and development documents.
are occasionally mapped to different CUIs.49
CUILESS2016
The role of representative data for clinical ambiguity A significant number of mentions in the ShARe corpus were not
Development and evaluation of models for any problem are predi- mapped to a CUI in the original annotations, either because these
cated on the availability of representative data.50 Prior research has mentions did not correspond to Disorder concepts in the UMLS or
highlighted the frequency of ambiguity in biomedical literature24,51 because they would have required multiple disorder concepts to an-
and broken biomedical ambiguity into 3 broad categories of ambig- notate.14 These mentions were later reannotated in the CUI-
uous terms, abbreviations, and gene names,52 but an in-depth char- LESS2016 dataset, with updated guidelines allowing annotation
acterization of the types of ambiguity relevant to clinical data has using any CUI in SNOMED CT (regardless of semantic type) and
not yet been performed. In order to understand what can be learned specified rules for composition.19,56 These data were split into train-
from the available data for ambiguity and identify areas for future ing and development sets, corresponding to the training and devel-
research, it is critical to analyze both the frequency and the types of opment splits in the SemEval-2015 shared task; the SemEval-2015
ambiguity that are captured in clinical datasets. test set was not annotated as part of CUILESS2016.Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 519
Table 1. Details of MCN datasets analyzed for ambiguity, broken down by data subset
ShARe Corpus
SemEval-2015 CUILESS2016 n2c2 2019
Training Development Combined Training Development Combined Training
53 19
UMLS version 2011AA 2016AA 2017AB37
Source vocabularies SNOMED CT (United States) SNOMED CT (United States) SNOMED CT (United States), RxNorm
Documents 298 133 431 298 133 431 100
Samples 11 554 8003 19 557 3468 1929 5397 6684
CUI-less samples 3480 1933 5413 7 1 8 368
Unique strings 3654 2477 5064 1519 750 2011 3230
Unique CUIs 1356 1144 1871 1384 639 1738 2331
The number of CUI-less samples, which were excluded from our analysis, is provided for each dataset.
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
CUI: concept unique identifier; MCN: medical concept normalization; SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms; UMLS: Unified
Medical Language System.
n2c2 2019 We defined dataset ambiguity, our measure of observed ambigu-
As the SemEval-2015 and CUILESS2016 datasets only included ity, as the number of unique CUIs associated with a given string
annotations for mentions of disorder-related concepts, Luo et al37 when aggregated over all samples in a dataset. In order to account
annotated a new corpus to provide mention and normalization data for minor variations in EHR orthography and annotations, we used
for a wider variety of concepts; these data were then used for a 2019 2 steps of preprocessing on the text of all medical concept mentions
n2c2 shared task on concept normalization.41 The corpus includes in each dataset: lowercasing and dropping determiners (a, an, and
100 discharge summaries drawn from the 2010 i2b2/VA shared task the).
on clinical concept extraction, for which documents from multiple To measure potential ambiguity, we defined UMLS ambiguity as
healthcare institutions were annotated for all mentions of problems, the number of CUIs a string is associated with in the UMLS Meta-
treatments, and tests.57 All annotated mentions in the 100 docu- thesaurus. While the Metathesaurus is necessarily incomplete,
15,58,59
ments chosen were normalized using CUIs from SNOMED CT and and the breadth and specificity of concepts covered means
RxNorm; 2.7% were annotated as “CUI-less.” All mentions were that useful term-CUI links are often missing,60 it nonetheless func-
dually annotated with an adjudication phase; preadjudication inter- tions as a high-coverage heuristic to measure the number of senses a
annotator agreement was a 67.69% CUI match (note this figure in- term may be used to refer to. However, the expressiveness of natural
cluded comparison of mention bounds in addition to CUI matches, language means that direct dictionary lookup of any given string in
lowering measured agreement; CUI-level agreement alone was not the Metathesaurus is likely to miss valid associated CUIs: linguistic
evaluated). Luo et al37 split the corpus into training and test sets. As phenomena such as coreference allow seemingly general strings to
with the SemEval-2015 data, we only analyzed the training set in or- take very specific meanings (eg, “the failure” referring to a specific
der to preserve the utility of the n2c2 2019 test set as an unseen data instance of heart failure); other syntactic phenomena such as predi-
sample for evaluating generalization in continuing MCN research. cation, splitting known strings with a copula (see Figure 1 for exam-
ples), and inflection (eg, “defibrillate” vs “defibrillation” vs
“defibrillated”) lead to further variants. We therefore use 3 strate-
Measuring ambiguity gies to match observed strings with terms in the UMLS and the con-
We utilize 2 different ways of measuring the ambiguity of a string:
cepts that they are linked to (referred to as candidate matching
dataset ambiguity, which measures the amount of observed ambigu-
strategies), with increasing degrees of inclusivity across term varia-
ity for a given medical term as labeled in an MCN dataset, and
tions, to measure the number of CUIs a medical concept string may
UMLS ambiguity, which measures the amount of potential ambigu-
be matched to in the UMLS:
ity for the same term by using the UMLS as a reference for normali-
zation. A key desideratum for developing and evaluating statistical • Minimal preprocessing—each string was preprocessed using the
models of MCN, which we demonstrate is not achieved by bench- 2 steps described previously (lowercasing and dropping deter-
mark datasets in practice, is that the ambiguity observed in research miners; eg, “the EKG” becomes “ekg”), and compared with
datasets is as representative as possible of the potential ambiguity rows of the MRCONSO table of the UMLS to identify the num-
that may be encountered in medical language “in the wild.” For ex- ber of unique CUIs canonically associated with the string. The
ample, the term cold can be used as an acronym for Chronic Ob- same minimal preprocessing steps were applied to the String field
structive Lung Disease (C0024117), but if no datasets include of MRCONSO rows for matching.
examples of cold being used in this way, we are unable to train or • Lexical variant normalization—each string was first processed
evaluate the effectiveness of an MCN model for normalizing “cold” with minimal preprocessing, and then further processed with the
to this meaning. The problem becomes more severe if other senses of luiNorm tool, 61 a software package developed to map lexical
cold, such as C0009264 Cold Temperature, C0234192 Cold Sensa- variants (eg, defibrillate, defibrillated, defibrillation) to the same
tion, or C0010412 Cold Therapy are also not included in annotated string. (Mapping lexical variants to the same underlying string is
datasets. While exhaustively capturing instances of every sense of a typically referred to as “normalization” in the natural language
given term in natural utterances is impractical at best, significant processing literature; for clarity between concept normalization
gaps between observed and potential ambiguity impose a fundamen- and string normalization in this article, we refer to “lexical vari-
tal limiting factor on progress in MCN research. ant normalization” for this aspect of string processing through-520 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 1. Examples of mismatch between medical concept mention string (bold underlined text) and assigned concept unique identifier (shown under the men-
tion), due to (A) coreference and (B) predication. The right side of each subfigure shows the results of querying the Unified Medical Language System (UMLS) for
the mention string with exact match (top) and the preferred string for the annotated concept unique identifier (bottom).
out.) luiNorm-processed strings were then compared with prepo- concordance with prior findings of greater ambiguity from shorter
pulated lexical variants in the MRXNS_ENG table of the UMLS terms,63 we evaluated the correlation between string length and am-
to identify the set of associated CUIs. We used the release of lui- biguity measurements, using linear regression with fit measured by
Norm that corresponded to the UMLS version each dataset was the r2 statistic. We used 2 different measures of string length: (1)
annotated with (2011 for SemEval-2015, 2016 for CUI- number of tokens in the string (calculated using SpaCy64 tokeniza-
LESS2016, and 2017 for n2c2 2019), and compared with the tion) and (2) number of characters in the string.
MRXNS_ENG table of the corresponding UMLS release.
• Word match—each string was first processed with minimal pre-
Cross-dataset generalization analysis
processing; we then queried the UMLS search application pro-
In order to assess how representative the annotated MCN datasets
gramming interface for the preprocessed string, using the word-
are for generalizing to unseen data, we evaluated ambiguity in 3
level search option,62 which searches for matches in the Metathe-
kinds of cross-dataset generalization: (1) from training to develop-
saurus with each of the words in the query string (ie, “Heart dis-
ment splits in a single dataset (using SemEval-2015 and CUI-
ease, acute” will match with strings including any of the words
LESS2016), (2) between different datasets drawn from the same
heart, disease, or acute). We counted the number of unique CUIs
corpus (comparing SemEval-2015 to CUILESS2016), and (3) be-
returned as our measure of ambiguity.
tween datasets from different corpora (comparing SemEval-2015
In all cases, since each dataset was only annotated using CUIs and CUILESS2016 to n2c2 2019). In each of these settings, we first
linked to specific vocabularies in the UMLS (SNOMED CT for all 3 identified the portion of strings shared between the datasets being
datasets, plus RxNorm for n2c2 2019), we restricted our ambiguity compared, a key component of generalization, and then analyzed
analysis to the set of unique UMLS CUIs linked to the source vocab- the CUIs associated with these shared strings in each dataset. Shared
ularies used for annotation. Thus, if a string in SemEval-2015 was strings were analyzed along 3 axes to measure the generalization of
associated with 2 CUIs linked to SNOMED CT and an additional MCN annotations between datasets: (1) differences in ambiguity
CUI linked only to International Classification of Diseases–Ninth type (for strings which were ambiguous in both datasets), (2) over-
Revision (and therefore not eligible for use in SemEval-2015 annota- lap in the annotated CUI sets, and (3) the coverage of word-level
tion), we only counted the 2 CUIs linked to SNOMED CT in mea- UMLS match for retrieving the combination of CUIs present be-
suring its ambiguity. tween the 2 datasets. Finally, we broke down our analysis of CUI set
overlap to identify strings whose dataset ambiguity increases when
combining datasets and strings with fully disjoint annotated CUI
Quantitative analyses: Ambiguity measurements and sets.
generalization
Ambiguity measurements within datasets Qualitative analysis of ambiguous strings
Given the set of unique mention strings in each MCN dataset, we Inspired by methodological research demonstrating that different
measured each string’s ambiguity in terms of dataset ambiguity, modeling strategies are appropriate for phenomena such as meton-
UMLS ambiguity with minimal preprocessing, UMLS ambiguity ymy65,66 and hyponymy,67–71 we analyzed the ambiguous strings in
with lexical variant normalization, and UMLS ambiguity with word each dataset in terms of the following lexical phenomena: homon-
match, using the version of the UMLS each dataset was originally ymy, polysemy, hyponymy, meronymy, co-taxonomy (sibling rela-
annotated with. We also evaluated the coverage of the UMLS tionships), and metonymy (definitions provided in discussion of our
matching results, in terms of whether they included the CUIs associ- ambiguity typology in the Results).42,43 To measure the ambiguity
ated with each string in the dataset. For compositional annotations captured by the available annotations, we performed our analysis
in CUILESS2016, we treated a label as covered if any of its compo- only at the level of dataset ambiguity (ie, only using the CUIs associ-
nent CUIs were included in the UMLS results. Finally, to establish ated with the string in a single dataset). For each ambiguous stringJournal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 521
in a dataset, we manually reviewed the string, its associated CUIs in omitted dataset annotations that were not found in the correspond-
the dataset in question, and the medical concept mention samples ing version of the UMLS (including “CUI-less,” annotation errors,
where the string occurs in the dataset, and answered the following 2 and CUIs remapped within the UMLS); Table 2 provides the number
questions: of these annotations and the number of strings analyzed. We ob-
served 5 main findings from our results:
Question 1: How are the different CUIs associated with this
Observed dataset ambiguity is not representative of potential
string related to one another?
UMLS ambiguity. Only 2%-14% of strings were ambiguous at the
This question regarded only the set of annotated CUIs and was dataset level (across SemEval-2015, CUILESS2016, and n2c2 2019)
agnostic to specific samples in the dataset. We evaluated 2 aspects of (ie, these strings were associated with more than 1 CUI within a sin-
the relationship or relationships between these CUIs: (1) which (if gle dataset). However, many more strings exhibited potential ambi-
any) of the previous lexical phenomena was most representative of guity, as measured in the UMLS with our 3 candidate matching
the relationship between the CUIs and (2) if any phenomenon partic- strategies. Using minimal preprocessing, in the cases in which at
ular to medical language was a contributing factor. We conducted least 1 CUI was identified for a query string, 13%-23% of strings
this analysis only in terms of the high-level phenomena outlined pre- were ambiguous; lexical variant normalization increased this to
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
viously, rather than leveraging the formal semantic relationships be- 17%-28%, and word matching yielded 68%-88% ambiguous
tween CUIs in the UMLS; while these relationships are powerful for strings. The difference was most striking in n2c2 2019: only 58
downstream applications, they include a variety of nonlinguistic strings were ambiguous in the dataset (after removing “CUI-less”
relationships and were too fine-grained to group a small set of am- samples), but 2,119 strings had potential ambiguity as measured
biguous strings informatively. with word matching, a 37-fold increase.
Many dataset strings do not match any CUIs. A total of 40%-
Question 2: Are the CUI-level differences reflected in the annotations?
43% of strings in SemEval-2015 and n2c2 did not yield any CUIs
Given the breadth of concepts in the UMLS, and the subjective when using minimal preprocessing to match to the UMLS (74% in
nature of annotation, we analyzed whether the CUI assignments in CUILESS2016). Lexical variant normalization increased coverage
the dataset samples were meaningfully different, and if they reflected somewhat, with 38%-41% of strings failing to match to the UMLS
the sample-agnostic relationship between the CUIs. in SemEval-2015 and n2c2 (70% in CUILESS2016); word-level
search had much better coverage, only yielding empty results for
Ambiguity annotations 23%-27% of CUIs in SemEval-2015 and n2c2 and 57% in CUI-
Based on our answers to these questions, we determined 3 variables LESS2016. As CUILESS2016 strings often combine multiple con-
for each string: cepts, matching statistics are necessarily pessimistic for this dataset.
UMLS matching misses a significant portion of annotated CUIs.
• Category—the primary linguistic or conceptual phenomenon un- As shown in Figure 2, for the subset of SemEval-2015 and n2c2
derlying the observed ambiguity; 2019 strings in which any of the UMLS matching strategies yielded
• Subcategory—the biomedicine-specific phenomenon contribut- at least 1 candidate CUI, 8%-23% of the time the identified candi-
ing to a pattern of ambiguity; and date sets did not include any of the CUIs with which those strings
• Arbitrary—the determination of whether the CUIs’ use reflected were actually annotated in the datasets. This was consistent for both
their conceptual difference. strings returning only 1 CUI and strings returning multiple CUIs.
The complex mentions in CUILESS2016 again yielded lower cover-
Annotation was conducted by 4 authors (D.N.-G., G.D., B.D., age: 24%-30% of strings returning only 1 CUI did not return a cor-
A.Z.) in 3 phases: (1) initial categorization of the ambiguous strings rect one and 25%-42% of strings returning multiple CUIs missed all
in n2c2 2019 and SemEval-2015, (2) validation of the resulting ty- of the annotated CUIs. This indicates that coverage of both syno-
pology through joint annotation and adjudication of 30 random am- nyms and lexical variants in the UMLS remains an active challenge
biguous strings from n2c2 2019, and (3) reannotation of all datasets for clinical language.
with the finalized typology. For further details, please see the Sup- High coverage yields high ambiguity. Table 2 provides statistics
plementary Appendix. on the number of CUIs returned for strings from the 3 datasets in
which any of the UMLS candidate matching strategies yielded more
Handling compositional CUIs in CUILESS2016 than 1 CUI. Both minimal preprocessing and lexical variant normal-
Compositional annotations in CUILESS2016 presented 2 variables ization yield a median CUI count per ambiguous string of 2, al-
for ambiguity analysis: single- or multiple-CUI annotations, and am- though higher maxima (maximum 11 CUIs with minimal
biguity of annotations across samples. We categorized each string in preprocessing, maximum 20 CUIs with lexical variant normaliza-
CUILESS as having (1) unambiguous single-CUI annotation, (2) un- tion) skew the mean number of CUIs per string higher. By contrast,
ambiguous multi-CUI annotation, (3) ambiguous single-CUI annota- word matching, which achieves the best coverage of dataset strings
tion, or (4) ambiguous annotations with both single- and multi-CUI by far, ranges in median ambiguity from 8 in CUILESS2016 to 20 in
labels. The latter 2 categories were considered ambiguous for our n2c2 2019, with maxima over 100 CUIs in all 3 datasets. Thus, ef-
analysis. fectively choosing between a large number of candidates is a key
challenge for high-coverage MCN.
Character-level string length is weakly negatively correlated with
RESULTS ambiguity measures. Following prior findings that shorter terms
Quantitative measurements of string ambiguity tend to be more ambiguous in biomedical literature,63 we observed
Ambiguity within individual datasets r2 values above 0.5 between character-based string length and data-
Figure 2 presents the results of our string-level ambiguity analysis set ambiguity, UMLS ambiguity with minimal preprocessing, and
across the 3 datasets. For a fair comparison with the UMLS, we UMLS ambiguity with lexical variant normalization in all 3 EHR522 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 2. String-level ambiguity in medical concept normalization (MCN) datasets, by method of measuring ambiguity. (A) Measurements of observed string am-
biguity in MCN datasets, in terms of strings that are annotated with exactly 1 concept unique identifier (CUI) (unambiguous) or more than 1 (ambiguous). (B)
Measurements of potential string ambiguity in the Unified Medical Language System (UMLS), using minimal preprocessing, lexical variant normalization, and
word match strategies to identify candidate CUIs. Shown below each UMLS matching chart is the coverage of dataset CUIs yielded by each matching strategy,
broken down by ambiguous (A) and unambiguous (U) strings. Coverage is calculated as the intersection between the CUIs matched to a string in the UMLS and
the set of CUIs that string is annotated with in the dataset.
datasets. Word-level match yielded very weak correlation (r2 ¼ 0.39 within-corpus setting (comparing SemEval-2015 to CUILESS2016),
for SemEval-2015, 0.23 for CUILESS2016, and 0.39 for n2c2). and cross-corpus setting (comparing SemEval-2015 and CUI-
Token-level measures of string length followed the same trends as LESS2016 to n2c2 2019). We observed 3 main findings in our
the character-level measure, although typically with lower r2. Full results:
results of these analyses are provided in Supplementary Table 1 and The majority of strings are unique to the dataset they appear in.
Supplementary Figures 1–3. The overlap in sets of medical concept mention strings between
datasets ranged fromJournal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 523
Table 2. Results of string-level ambiguity analysis, as measured in MCN datasets (observed ambiguity) and in the UMLS with 3 candidate
matching strategies (potential ambiguity)
SemEval-2015 CUILESS2016 n2c2 2019
UMLS version 2011AA 2016AA 2017AB
Dataset Total strings 3203 2006 3230
Ambiguous strings before OOV filtering 148 (5) 273 (14) 62 (2)
Strings with OOV annotations 48 1 99
OOV annotations only (omitted) 29 1 95
Strings with at least 1 CUI 3174 2005 3135
Ambiguous strings after OOV filtering 132 (4) 273 (14) 58 (2)
Minimum/median/maximum ambiguity 2/2/6 2/2/24 2/2/3
Mean ambiguity 2.1 6 0.5 2.9 6 2.5 2.1 6 0.3
Minimal preprocessing Strings with at least 1 CUI 1808 (57) 530 (26) 1874 (60)
Ambiguous strings 230 (13) 97 (18) 423 (23)
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Minimum/median/maximum ambiguity 2/2/11 2/2/11 2/2 /11
Mean ambiguity 2.5 6 1.1 2.7 6 1.5 2.5 6 1.2
Lexical variant normalization Strings with at least 1 CUI 1882 (59) 592 (30) 1942 (62)
Ambiguous strings 318 (17) 137 (23) 550 (28)
Minimum/median/maximum ambiguity 2/2/17 2/2/18 2/2/20
Mean ambiguity 2.8 6 1.9 3.1 6 2.5 2.9 6 2.1
Word match Strings with at least 1 CUI 2314 (73) 877 (44) 2414 (77)
Ambiguous strings 1774 (77) 594 (68) 2119 (88)
Minimum/median/maximum ambiguity 2/9/123 2/8/107 2/20/120
Mean ambiguity 20.9 6 25.5 19.5 6 24.5 31.1 6 29.2
Values are n, n (%), or mean 6 SD, unless otherwise indicated. All dataset annotations that were not found in the corresponding version of the UMLS (OOVs)
were omitted from this analysis; any strings that had only OOV annotations in the dataset were omitted entirely. For each of the 3 UMLS matching strategies, the
number of strings for which at least 1 CUI was identified is provided along with the corresponding percentage of non-OOV dataset strings. The number of ambig-
uous strings in each subset (ie, strings for which more than 1 CUI was matched after OOV annotations were filtered out) is given along with the corresponding
percentage of strings for which at least 1 CUI was identified. Ambiguity statistics are calculated on ambiguous strings only and report minimum, median, maxi-
mum, mean, and standard deviation of number of CUIs identified for the string.
CUI: concept unique identifier; MCN: medical concept normalization; OOV: out of vocabulary; UMLS: Unified Medical Language System.
Most shared strings have differences in their annotated CUIs. In sets between the 2 datasets were originally unambiguous in each
all comparisons other than the SemEval-2015 training and develop- dataset, indicating that memorizing term-CUI normalization would
ment datasets, over 45% of the strings shared between a pair of data- work perfectly in each dataset but fail entirely on the other.
sets were annotated with at least 1 CUI that was only present in 1 of
the 2 datasets (18% of strings even in the case of SemEval-2015 train- Ambiguity typology
ing and development datasets). Of these, between 33%-74% had We identified 12 distinct causes of the ambiguity observed in the
completely disjoint sets of annotated CUIs between the 2 datasets com- datasets, organized into 5 broad categories. Table 3 presents our ty-
pared. While many of these cases reflected hierarchical differences, a pology, with examples of each ambiguity type; brief descriptions of
significant number involved truly distinct senses between datasets. each overall category are provided subsequently. We refer the inter-
UMLS match consistently fails to yield all annotated CUIs across ested reader to the Supplementary Appendix for a more in-depth dis-
combined datasets. Reflecting our earlier observations within indi- cussion.
vidual datasets, word-level UMLS matching was able to fully re-
trieve all CUIs in the combined annotation set for a fair portion of Polysemy
shared strings (42%-55% in within-dataset comparisons; 54%-85% We combined homonymy (completely disjoint senses) and polysemy
in cross-corpus comparisons). However, it failed to retrieve any of (distinct but related senses)42,43 under the category of Polysemy for
the combined CUIs for 26%-54% of the shared strings. our analysis. While we observed instances of both homonymy and
Figure 4 illustrates changes in ambiguity for shared strings be- polysemy, we found no actionable reason to differentiate between
tween the dataset pairs, in terms of how many strings had nonidenti- them, particularly as other phenomena causing polysemy (eg, me-
cal annotated CUI sets, how many strings in each dataset would tonymy, hyponymy) were covered by other categories. Thus, the Po-
increase in ambiguity if the CUI sets were combined, and how many lysemy category captured cases in which more specific phenomena
of these would switch from being unambiguous to ambiguous when were not observed and the annotated CUIs were clearly distinct
combining cross-dataset CUI sets. We found that of the sets of from one another. As there is extensive literature on resolving abbre-
strings shared between any pair of datasets with nonidentical CUI viations and acronyms,31–35 we treated cases involving abbrevia-
annotations, between 50% and 100% of the strings in each of these tions as a dedicated subcategory (Abbreviation; our other
sets were annotated with at least 1 CUI in one of the datasets that subcategory was Nonabbreviation).
was not present in the other. Further, up to 66% of the strings with
any annotation differences went from being unambiguous to ambig- Metonymy
uous when CUI sets were combined across the dataset pairs. Finally, Clinical language is telegraphic, meaning that complex concepts are
we found that up to 89% of the strings that had fully disjoint CUI often referred to by simpler associated forms. Normalizing these524 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 3. Generalization analysis for medical concept normalization annotations, in 3 settings: (A, B) between training and development sets in the same datasets,
(C, D) between 2 datasets drawn from the same electronic health record corpus (both from the ShARe corpus), and (E, F) across annotated corpora. The first col-
umn illustrates the number of unique strings in each sample set in the pair being analyzed, along with the number of strings present in both. The second column
shows the subsets of these shared strings in which the sample sets use at least 1 different concept unique identifier (CUI) for the same string, and the number of
strings in which all CUIs are different between the 2 sample sets. The third column shows for how many of the shared strings the Unified Medical Language Sys-
tem (UMLS) matching with word search identifies some or all of the CUIs annotated for a given string between both sample sets.Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 525
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 4. Analysis of concept unique identifier (CUI) sets for shared strings in medical concept normalization generalization between datasets, in 3 settings: (A, B)
between training and development sets in the same datasets, (C, D) between 2 datasets drawn from the same electronic health record corpus (both from the
ShARe corpus), and (E, F) across annotated corpora. The left-hand column illustrates (1) the number of shared strings with differences in their CUI annotations;
(2) the proper subset of these strings, within each dataset, in which adding the CUIs from the other dataset would expand the set of CUIs for this string; and (3)
the proper subset of these strings where a string is unambiguous within one or the other dataset but becomes ambiguous when CUI annotations are combined.
The right-hand column displays the portion of shared strings with disjoint CUI set annotations between the 2 datasets in which the string is unambiguous in each
of the datasets independently.526 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
Table 3. Ambiguity typology derived from SemEval-2015, CUILESS2016, and n2c2 2019 MCN corpora
Category Subcategory Definition Example ambiguity
Polysemy Abbreviation Abbreviations or acronyms with dis- Family hx of breast [ca], emphysema C0006826 Malignant Neoplasms
tinct senses. BP 137/80 na 124 [ca] 8.7 C0201925 Calcium Measurement
Nonabbreviation Term ambiguity other than abbrevia- BP was [elevated] at last 2 visits C0205250 High (qualitative)
tions or acronyms. Her leg was [elevated] after surgery C0439775 Elevation procedure
Metonymy Procedure vs Distinguishes between a medical con- [Rhythm] revealed sinus tachycardia C0199556 Rhythm ECG (Procedure)
Concept cept and the procedure or action The [rhythm] became less stable C0577801 Heart rhythm (Finding)
used to analyze/effect that con-
cept.
Measurement vs Distinguishes between a physical Pt blood work to check [potassium] C0032821 Potassium (Substance)
Substance substance and a measurement of Sodium 139, [potassium] 4.7 C0202194 Potassium Measurement
that substance.
Symptom vs Distinguishes between a finding be- Current symptoms include C0011570 Mental Depression
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Diagnosis ing marked as a symptom or a [depression]
(possibly diagnosed) disorder. Hx of chronic [depression] C0011581 Depressive disorder
Other All other types of metonymy. Transfusion of [blood] C0005767 Blood (Body Substance)
Discovered [blood] at catheter site C0019080 Hemorrhage
Specificity Hierarchical Combines hyponymy and meron- Cardiac: family hx of [failure] C0018801 Heart Failure
ymy; corresponds to taxonomic . . .in left ventricle. This [failure]. . . C0023212 Left-sided heart failure
UMLS relations.
Recurrence/ Distinguishes between singular and No [injuries] at admission C0175677 Injury
Number plural forms of a finding, or one Brought to emergency for C0026771 Multiple trauma
episode and recurrent episodes. his [injuries]
Synonymy Propositional For a general-purpose application, Negative skin [jaundice] C0022346 Icterus
Synonyms the set of CUIs are not meaning- Increased girth and [jaundice] C0476232 Jaundice
fully distinct from one another.
Co-taxonyms The CUIs are (conceptually or in the 2mg [percodan] C0717448 Percodan
UMLS) taxonomic siblings; often 2mg [percodan] C2684258 Percodan
overspecification. (reformulated 2009)
Error Semantic Erroneous CUI assignment, due to Open to air with no [erythema] C0041834 Erythema
misinterpretation, confusion with Edema but no [erythema] C0013604 Edema
nearby concept, or other cause.
Typos One CUI is a typographical error [Neoplasm] is adjacent C0024651 Malt Grain (Food)
when attempting to enter the other Infection most likely [neoplasm] C0027651 Neoplasms
(ie, no real ambiguity).
Short definitions are provided for each subcategory, along with 2 samples of an example ambiguous string and their normalizations using UMLS CUIs. For a
more detailed discussion, see the Supplementary Appendix.
CUI: concept unique identifier; UMLS: Unified Medical Language System.
references requires inference from their context: for example, a ref- observed was ambiguity in grammatical number of a finding, typi-
erence to “sodium” within lab readings implies a measurement of cally due to inflection (eg, “no injuries” meaning not a single injury)
sodium levels, a distinct concept in the UMLS. It is noteworthy that or recurrence (denoted Recurrence/Number).
in some cases, examples of the Metonymy category may be consid-
ered as annotation errors, illustrating the complexity of metonymy
Synonymy
in practice; for example, the case of “Sodium 139, [potassium] 4.7”
Many strings were annotated with CUIs that were effectively synon-
included in Table 3, annotated as C0032821 Potassium (substance),
ymous; we therefore followed Cruse’s42 definition of Propositional
would be better annotated as C0428289 Finding of potassium level.
Synonymy, in which ontologically distinct senses nonetheless yield
As these concepts are semantically related (while ontologically dis-
the same propositional interpretation of a statement. We also in-
tinct), we included such cases in the category of Metonymy. We ob-
cluded Co-taxonymy in this category, typically involving annotation
served 3 primary trends in metonymic annotations: reference to a
with either overspecified CUIs or CUIs separated only by negation.
procedure by an associated biological property (Procedure vs Con-
cept), mention of a biological substance to refer to its measurement
(Measurement vs Substance), and the fact that many symptomatic Error
findings can also be formal diagnoses (Symptom vs Diagnosis; eg, A small number of ambiguity cases were due to erroneous annota-
“emphysema,” “depression”). Other examples of Metonymy falling tions stemming from 2 causes: (1) typological errors in data entry
outside these trends were placed in the Other subcategory. (Typos) and (2) selection of an inappropriate CUI (Semantic).
Specificity Ambiguity types in each dataset
The rich semantic distinctions in the UMLS (eg, phenotypic variants As with our measurements of string ambiguity, we excluded all data-
of a disease) lead to frequent ambiguity of Specificity. The ambiguity set samples annotated as “CUI-less” for analysis of ambiguity type,
was often taxonomic, captured as Hierarchical; the other pattern as these reflect annotation challenges beyond the ambiguity level.Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 527
Table 4. Results of ambiguity type analysis, showing the number of unique ambiguous strings assigned to each ambiguity type by dataset,
along with the total number of dataset samples in which those strings appear
SemEval-2015 CUILESS2016 n2c2 2019
Category Subcategory Strings Samples Strings Samples Strings Samples
Polysemy Abbreviation 4 59 6 178 7 33
Nonabbreviation 2 2 12 302 6 28
Metonymy Procedure vs Concept 0 0 7 25 9 23
Measurement vs Substance 0 0 0 0 9 93
Symptom vs Diagnosis 20 62 20 166 2 5
Other 2 3 6 22 5 29
Specificity Hierarchical 50 103 87 776 7 26
Recurrence/Number 8 24 3 6 0 0
Synonymy Propositional Synonyms 23 26 64 354 8 26
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Co-taxonyms 9 11 64 837 4 13
Error Typos 25 25 0 0 0 0
Semantic 8 11 22 109 1 1
Total (unique) 148 326 273 2775 58 295
Some strings were assigned multiple ambiguity types, and are counted for each; the number of affected samples was estimated for each type in these cases. The
sample counts given for error subcategories represent the actual count of misannotated samples. The total number of unique ambiguous strings and associated
samples analyzed in each dataset is presented in the last row.
Figure 5. Distribution of ambiguity types within each dataset, in terms of (A) the unique strings assigned each ambiguity type and (B) the number of samples in
which those strings occur. The number of strings and samples belonging to each typology category is shown within each bar portion.
However, we retained samples with annotation errors and CUIs tion, shown in Figure 6. Arbitrary rates varied across datasets, with
remapped within the UMLS, as these samples inform MCN evalua- the fewest cases in SemEval-2015 and the most in n2c2 2019.
tion in these datasets, and ambiguity type analysis did not require di- Metonymy (Symptom vs Diagnosis), Specificity (Hierarchical), and
rect comparison to string-CUI associations in the UMLS. This Synonymy (Co-taxonyms) were all arbitrary in more than 50% of
increased the number of ambiguous strings in SemEval-2015 from cases.
132 to 148; ambiguous string counts in CUILESS2016 and n2c2
2019 were not affected. Table 4 presents the frequency of each am-
biguity type across our 3 datasets. All but 21 strings (3 in SemEval-
2015, 18 in CUILESS2016) exhibited a single ambiguity type (ie, all
DISCUSSION
CUIs were related in the same way). To compare the distribution of Ambiguity is a key challenge in medical concept normalization.
ambiguity categories across datasets, we visualized their relative fre- However, relatively little research on ambiguity has focused on clini-
quency in Figure 5. Polysemy and Metonymy strings were most cal language. Our findings demonstrate that clinical language exhib-
common in n2c2 2019, while Specificity was the plurality category its distinct types of ambiguity, such as clinical patterns in metonymy
in SemEval-2015 and Synonymy was most frequent in CUI- and specificity, in addition to well-studied problems such as abbrevi-
LESS2016. The sample-wise distribution, included in Table 4, fol- ation expansion. These results highlight 3 key gaps in the literature
lowed the string-wise distribution, except for Polysemy, which for MCN ambiguity: (1) a significant gap between the potential am-
included multiple high-frequency strings in SemEval-2015 and CUI- biguity of medical terms and their observed ambiguity in EHR data-
LESS2016. sets, creating a need for new ambiguity-focused datasets; (2) a need
Finally, we visualized the proportion of strings within each ambi- for MCN evaluation strategies that are sensitive to the different
guity type considered arbitrary (at the sample level) during annota- kinds of relationships between concepts observed in our ambiguity528 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3
Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 6. Percentage of ambiguous strings in each ambiguity type annotated as arbitrary, by dataset. Synonymy (Propositional Synonyms) and both Error subca-
tegories are omitted, as they are arbitrary by definition.
typology; and (3) underutilization of the extensive semantic resour- actly matched the gold CUI. On this view, a predicted CUI is either
ces of the UMLS in recent MCN methodologies. We discuss each of exactly right or completely wrong. However, as illustrated by the
these points in the following sections, and propose specific next distinct ambiguity types we observed, in many cases a CUI other
steps toward closing these gaps to advance the state of MCN re- than the gold label may be highly related (eg, “Heart failure” and
search. We conclude by noting the particular role of representative “Left-sided heart failure”), or even propositionally synonymous. As
data in the deep learning era and providing a brief discussion of the methodologies for MCN improve and expand, alternative evalua-
limitations of this study that will inform future research on ambigu- tion methods leveraging the rich semantics of the UMLS can help to
ity in MCN. distinguish between a system with a related misprediction from a
system with an irrelevant one. A wide variety of similarity and relat-
edness measures that utilize the UMLS to compare medical concepts
The next phase of research on clinical ambiguity needs
have been proposed,72–75 presenting a fruitful avenue for develop-
dedicated datasets
ment of new MCN evaluation strategies.
The order of magnitude difference between the number of CUIs an-
It is important to note, however, that equivalence classes and
notated for each string in our 3 datasets, and the number of CUIs
similarity measures will often be task or domain specific. For exam-
found through word match to the UMLS suggests that our current
ple, 2 heart failure phenotypes may be equivalent for presenting
data resources cover only a small subset of medically relevant ambi-
summary information in an EHR dashboard but may be highly dis-
guity. Differences in ambiguity across multiple datasets provide
tinct for cardiology-specific text mining or applications with de-
some improvement in addressing this coverage gap and clearly indi-
tailed requirements such as clinical trial recruitment. While
cate the value of evaluating new MCN methods on multiple datasets
dedicated evaluation metrics for each task would be impractical, a
to improve ambiguity coverage. However, the ShARe and MCN
trade-off between generalizability and sensitivity to the needs of dif-
corpora were designed to capture an in-depth sample of clinical lan-
ferent applications represents an area for further research.
guage, rather than a sample with high coverage of specific challenges
like ambiguity. As MCN research continues to advance, more fo-
cused datasets capturing specific phenomena are needed to support
The UMLS offers powerful semantic tools for high-
development and evaluation of methodologies to resolve ambiguity.
coverage candidate identification
Savova et al25 followed the protocol used in designing the biomedi-
Our cross-dataset comparison clearly demonstrates the value of uti-
cal NLM WSD corpus24 to develop a private dataset containing a
lizing inclusive UMLS-based matching to identify a high-coverage
set of highly ambiguous clinical strings; adapting and expanding this
set of candidate CUIs for a medical concept, though the lack of
protocol with resources such as MIMIC-III54 offers a proven ap-
100% coverage reinforces the value of ongoing research on syno-
proach to collect powerful new datasets.
nym identification.60 Inclusive matching, of course, introduces addi-
tional noise: luiNorm can overgenerate semantically invalid variants
Distinct ambiguity phenomena in MCN call for different due to homonymy,76 such as mapping “wound” in “injury or
evaluation strategies wound” to “wind,” and mapping both “left” and “leaves” to
MCN systems are typically evaluated in terms of accuracy,39,55 cal- “leaf”; word-level search, meanwhile, requires very little to yield a
culated as the proportion of samples in which the predicted CUI ex- match and generates very large candidate sets, such as 120 differentYou can also read