MESHUP: A CORPUS FOR FULL TEXT BIOMEDICAL DOCUMENT INDEXING

Page created by Allan Norris
 
CONTINUE READING
MeSHup: A Corpus for Full Text Biomedical Document
                                                                        Indexing

                                                             Xindi Wang1,3 , Robert E. Mercer1,3 , Frank Rudzicz2,3,4
                                               1
                                                   Department of Computer Science, University of Western Ontario, London, Ontario, Canada
                                                     2
                                                       Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
                                                                         3
                                                                           Vector Institute, Toronto, Ontario, Canada
                                                                      4
                                                                        Unity Health Toronto, Toronto, Ontario, Canada
                                                               xwang842@uwo.ca, mercer@csd.uwo.ca, frank@cs.toronto.edu
                                                                                          Abstract
                                         Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document
                                         with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number
                                         of biomedical articles in the PubMed database are manually annotated by human curators, which is time
                                         consuming and costly; therefore, a computational system that can assist the indexing is highly valuable.
arXiv:2204.13604v1 [cs.CL] 28 Apr 2022

                                         When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is
                                         desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems
                                         is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup,
                                         which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata,
                                         authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model
                                         that combines features from documents and their associated labels on our corpus and report the new baseline.

                                         Keywords: MeSH Indexing, Multi-label text classification

                                                            1.   Introduction                     Records (SCRs) refer to specific chemical sub-
                                                        1
                                         MEDLINE comprises 33 million (as of Nov.                 stances in the MEDLINE records.
                                         2021) references to journal articles in the life         MeSH indexing, a process that annotates doc-
                                         sciences with a concentration on biomedicine,            uments with concepts from established seman-
                                         which is the National Library of Medicine’s2             tic taxonomies and ontologies, is important for
                                         (NLM) premier bibliographic database. It in-             biomedical text classification and information
                                         cludes textual information (title and abstract)          retrieval. MEDLINE citations are indexed by
                                         and bibliographic information for articles from          human annotators who read the full text of the
                                         academic journals covering various disciplines           article and assign the most relevant MeSH la-
                                         of the life sciences and biomedicine. PubMed3            bels to the articles. The manual annotation pro-
                                         is a free search engine that provides free ac-           cess ensures the high quality of indexing but,
                                         cess to the MEDLINE database. In addition                inevitably, the cost can be prohibitive. There
                                         to MEDLINE, PubMed also provides access to               has been a steady and sizeable increase in
                                         the PubMed Central4 (PMC) repository that                the number of citations that are added to the
                                         archives open-access, full-text scholarly articles       MEDLINE database every year; for instance,
                                         in biomedical and life sciences journals. All            in 2020, 952,919 articles were added (approx-
                                         records in the MEDLINE database are indexed              imately 2,600 on a daily basis)6 and the aver-
                                         with Medical Subject Headings (MeSH)5 – a                age cost of annotation per document is approx-
                                         controlled and hierarchically-organized vocabu-          imately $9.40 (Mork et al., 2013). Faced with
                                         lary produced and maintained by the NLM. As              the growing workload, NLM annotators remain
                                         of 2021, there are 29,369 main MeSH head-                tasked with indexing newly published articles
                                         ings, and each citation is indexed with 13 MeSH          efficiently and promptly. Therefore, there is a
                                         terms on average. MeSH headings can be fur-              pressing need for automatic supports to index-
                                         ther qualified by 83 subheadings (also known as          ing biomedical literature.
                                         qualifiers). In addition, Supplementary Concept          Many state-of-the-art models have been pro-
                                                                                                  posed to deal with MeSH indexing; however,
                                           1
                                             https://www.nlm.nih.gov/medline/medline_             there is a clear drawback to these automatic in-
                                         overview.html                                            dexing systems because of the data used to train
                                           2
                                             https://www.nlm.nih.gov                              them. Existing corpora for MeSH indexing only
                                           3
                                             https://pubmed.ncbi.nlm.nih.gov/about/
                                           4                                                        6
                                             https://en.wikipedia.org/wiki/PubMed_Central            https://www.nlm.nih.gov/bsd/medline_pubmed_
                                           5
                                             https://www.nlm.nih.gov/mesh/meshhome.html           production_stats.html
provide the title and abstract, while human an-      and 84,355 manually annotated chemical enti-
notators review full text articles. This suggests    ties. CHEMDNER labels entities based on the
that important information might be contained        Chemicals and Drugs branch of the MeSH hier-
in the full text which is available to the human     archy and the MeSH substances. The BioCre-
indexer, but does not appear in the title and ab-    ative V Chemical-Disease Relation Task Corpus
stract that automatic indexing systems use to        (BC5CDR) (Li et al., 2015) was developed for
make recommendations. Mork et al. (2017) fur-        the BioCreative V challenge. A team of Medi-
ther indicated that some sections in the full text   cal Subject Headings (MeSH) indexers for dis-
that include very specific information, such as      ease/chemical entity annotation and Compara-
the “Methods” section, help improve the perfor-      tive Toxicogenomics Database (CTD) curators
mance of automatic models. Thus, a corpus that       for CID relation annotation were invited to en-
contains full text articles and their associated     sure high annotation quality and productivity.
MeSH labels is highly desirable.                     Detailed annotation guidelines and automatic
In this work, we construct a new labeled full text   annotation tools were provided. The resulting
MeSH indexing dataset7 , MeSHup, that for each       corpus consists of 1500 PubMed articles with
entry, is a mashup of the PMID, title, abstract,     4409 annotated chemicals, 5818 diseases and
journal, year, author list, MeSH terms, chem-        3116 chemical-disease interactions. Each entity
ical list, and Supplementary Concept Records         annotation includes both the mention text spans
from MEDLINE and the full text introduction,         and normalized concept identifiers, using MeSH
methods, results, discussion, figure captions,       as the controlled vocabulary. The NLM-CHEM
and table captions that are available from BioC-     corpus (Islamaj et al., 2021), created for Track
PMC (Comeau et al., 2019). To the best of our        2 of BioCreative VII, consists of 150 full text ar-
knowledge, MeSHup is the first publicly avail-       ticles with chemical entity annotations provided
able (and the largest) full text dataset annotated   by human experts for 5000 unique chemical
for MeSH indexing. We also propose a multi-          names, mapped to 2000 MeSH identifiers. This
channel model that incorporates extracted fea-       dataset is compatible with the CHEMDNER and
tures from different sections of the full text and   BC5CDR corpora described above.
report its baseline results.
                                                     2.2.   Related Full-Text Biomedical
                                                            Corpora
               2.   Related Work
                                                     A number of full-text corpora have been created
2.1.     Biomedical Corpora Related to               in the biomedical domain. Each has been gen-
         MeSH Terms                                  erated for specific purposes and consequently
There are several corpora in the biomedical do-      has been annotated with that in mind. The
main that contain or make use of MeSH terms.         earliest is a small five biomedical paper corpus
The OSHUMED test collection (Hersh et al.,           (Gasperin et al., 2007) which was designed to
1994) is a set of 348,566 clinically-oriented ref-   capture anaphora. The corpus of biomedical
erences in the MEDLINE database which are ob-        articles are annotated with anaphoric links be-
tained from 270 medical journals in the years        tween coreferent and associative (biotype, ho-
1987 to 1991. For each citation, the collec-         molog, and set-member) noun phrases referring
tion contains the title, abstract, MeSH index-       to the biomedical entities of interest to the au-
ing terms, author, source, and publication type.     thors. For the BioCreative I task 2 (Hirschman
The OSHUMED corpus is one of the earliest            et al., 2005), the training set corpus comprised
corpora that is related to the MeSH indexing         803 full text articles from four different jour-
task. The GENIA corpus (Kim et al., 2003)            nals that had been previously annotated for hu-
is a valuable resource in the biomedical litera-     man protein function and the test set corpus
ture that was created to support the develop-        comprised 212 full text articles. The CRAFT
ment of tools for text mining and information        (The Colorado Richly Annotated Full-Text) cor-
retrieval and their evaluation. It contains 2,000    pus (Verspoor et al., 2012) is a collection of 97
abstracts taken from the MEDLINE database            articles from the PubMed Central Open Access
with a variety of entity types in the GENIA          subset. Each article has been manually anno-
Chemical ontology that are derived from MeSH         tated along structural, coreference, and concept
terms. The CHEMDNER corpus (Krallinger et            dimensions. More recently, the BioC-BioGRID
al., 2014) contains 10,000 PubMed abstracts          corpus (Dogan et al., 2017) comprises full text
                                                     articles annotated for protein-protein and ge-
  7
      https://github.com/xdwang0726/MeSHup           netic interactions.
2.3.     Automatic MeSH Indexing Based               nal issues in the PubMed database and used the
         on Title and Abstract                       full text as input to MTI. They found that us-
As discussed, there is rapid growth in the num-      ing the full text of an article provides signifi-
ber of articles in MEDLINE, and the NLM has          cantly better (7.4%) quality of automatic index-
developed an indexing tool, Medical Text In-         ing than using only abstracts and titles. Jimeno-
dexer (MTI), to recommend MeSH terms in au-          Yepes et al. (2012) used a collection of 1,413
tomated and semi-automated modes (Aronson et         biomedical articles randomly selected from the
al., 2004). MTI first takes the title and abstract   PMC Open Access Subset. They first ran au-
of the input article and generates recommended       tomatic summaries over the full test and then
MeSH terms, using a ranking algorithm to de-         used the generated summary as input to MTI.
termine final suggestions. There are two im-         The experimental results showed that incorpo-
portant components in MTI: MetaMap Indexing          rating full texts achieved higher (6%) recall with
(MMI) and PubMed-Related Citations (PRC) (Lin        a trade-off in precision compared to using the
and Wilbur, 2007; Aronson and Lang, 2010).           abstracts and titles only. Demner-Fushman and
MetaMap recommends MeSH terms based on               Mork (2015) collected 14,829 citations and used
the mapping of biomedical concepts in the title      a rule-based method to classify Check Tags, a
and abstract of the input article to the the Uni-    small set of MeSH terms (29 MeSH Check Tags)
fied Medical Language System8 (UMLS). PRC            that represent characteristics of the subjects.
suggests MeSH terms by looking at similar an-        Wang and Mercer (2019) released a full text
notations in MEDLINE using k -nearest neigh-         dataset curated from PMC with 257,590 arti-
bours. Two sets of recommended MeSH terms            cles and employed a multi-channel CNN-based
are combined to generate the final MeSH list.        feature extraction model. FullMeSH (Dai et
Since 2013, BioASQ9 (Tsatsaronis et al., 2015)       al., 2019) and BERTMeSH (You et al., 2020)
has organized the biomedical semantic indexing       used 1.4M full text articles, the former with an
challenge, which offers the opportunity for more     attention-based CNN and the latter with pre-
participants to get involved in the MeSH index-      trained contextual embeddings together with an
ing task. BioASQ provides annotated PubMed           attention mechanism. Unfortunately, they did
articles with the title and abstract only, and       not make their dataset available, which makes
participants can tune their annotation models        it difficult for others to compare and evaluate
accordingly. Many effective indexing systems         their work.
have been proposed since then, such as MeSH-                 3.   Dataset Construction
Labeler (Liu et al., 2015), DeepMeSH (Peng
                                                     In this subsection, we introduce how to con-
et al., 2016), AttentionMeSH (Jin et al., 2018),
                                                     struct the dataset based on the PubMed Cen-
MeSHProbeNet (Xun et al., 2019), and Ken-
                                                     tral Open Access in BioC format10 (BioC-
MeSH (Wang et al., 2022). MeSHLabeler and
                                                     PMC) (Comeau et al., 2019) and the MED-
DeepMeSH are models based on a Learning-
                                                     LINE/PubMed Annual Baseline Repository11
to-Rank (LTR) framework. AttentionMeSH and
                                                     (MBR). We download the entire BioC-PMC sub-
MeSHProbeNet both utilize deep recursive neu-
                                                     set (as of Nov. 2021) and obtain 3,601,092 full
ral networks (RNNs) and attention mechanisms,
                                                     text articles.     We also download the entire
where the main difference is that the former
                                                     MBR collection (as of Nov. 2021) and obtain
uses label-wise attention while the latter em-
                                                     31,850,051 citations with metadata in the MED-
ploys multi-view self-attentive MeSH probes.
                                                     LINE database. In order to reduce bias, we
KenMeSH combines text features and the MeSH
                                                     only consider articles indexed by human anno-
label hierarchy by using a dynamic knowledge-
                                                     tators (i.e., articles in the MEDLINE database
enhanced mask to index MeSH terms.
                                                     with modes marked as ‘curated’ (MeSH terms
2.4.     MeSH Indexing Based on Full                 were provided algorithmically and were human
         Text                                        reviewed) or ‘auto’ (MeSH terms were provided
MeSH indexing with full texts has been studied       algorithmically) are not considered)12 , and we
using relatively small sets of data or restricted    only focus on articles written in English (i.e.,
to small numbers of specific MeSH terms be-           10
                                                          https://www.ncbi.nlm.nih.gov/research/bionlp/
cause of the limitation of full text access. Gay
                                                     APIs/BioC-PMC/
et al. (2005) collected 500 articles in 17 jour-       11
                                                          https://lhncbc.nlm.nih.gov/ii/information/MBR.
                                                     html
  8                                                    12
      https://www.nlm.nih.gov/research/umls/              https://www.nlm.nih.gov/bsd/licensee/elements_
  9
      http://bioasq.org                              descriptions.html
only articles annotated as ‘eng’ are considered)            Metadata      Descriptions
                                                                          The PubMed (NLM database that
in the MEDLINE database. We then match                        PMID
                                                                          incorporates MEDLINE) unique identifier.
BioC-PMC articles with MBR citations using the                            Personal and collective (corporate)
                                                             Authors
PubMed ID (PMID) and obtain a set of 1,342,667                            author names published with the article.
                                                             Journal      The journal that the article is published in.
biomedical documents.                                          Year       The year in which the article is published.
                                                               DOI        Digital Object Identifiers.
Information extracted from BioC-PMC.
                                                                          NLM controlled vocabulary,
Each article in the BioC-PMC subset is struc-              MeSH Terms
                                                                          Medical Subject Headings (MeSH).
tured in a single XML file. The original pub-             Supply MeSH     Supplementary Concept Record (SCR) terms.
                                                          Chemical List   A list of chemical substances and enzymes.
lished articles are formatted in various ways de-
pending on the publisher. With the BioC format           Table 2: Meta-data extracted from MBR and
(Comeau et al., 2019), an article’s textual in-          their descriptions
formation is preserved and each article is orga-
nized in a unified structure. We parse the tags in
the BioC formatted XML files to get the section          {"articles":[
names and their corresponding texts. We then                 {"PMID": ,
divide and normalize all full text articles into              "TITLE": ,
eight BioC sections: title, abstract, introduc-               "ABSTRACT": ,
tion, methods, results, discussion, figure cap-               "INTRO": ,
tions, and table captions. Table 1 summarizes                 "METHODS": ,
the statistical information for the described sec-            "RESULTS": ,
tions.                                                        "DISCUSS": ,
                                                              "FIG_CAPTIONS": ,
     Sections      number of articles   average length
                                                              "TABLE_CAPTIONS": ,
       Title          1,342,667               16
                                                              "JOURNAL": ,
     Abstract         1,342,667              258
   Introduction       1,279,276              991              "YEAR": ,
     Methods          1,135,757              1446             "DOI": ,
      Results         1,090,981              1640             "AUTHORS": ,
     Discuss          1,042,379              1249             "MeSH": ,
 Figure Captions      1,155,208              560
                                                              "CHEMICALS": ,
 Table Captions        520,780               123
                                                              "SUPPLMeSH":
Table 1: Statistics of the generated dataset for             },
each of the eight sections                                   {
                                                              ...
                                                             },
                                                             ...
Information extracted from the MED-
                                                         ]}
LINE/PubMed Annual Baseline (MBR).
Starting in 2002, MBR has provided access to
all MEDLINE citations and it updates and adds            Figure 1: The section descriptors given as JSON
new citations every year. Each year’s baseline           keys.
contains textual information (title and abstract)
of the citations as well as various types of meta-
data, such as their authors, publishing venues,          JSON format in Fig. 1, and an abbreviated ver-
and references. Metadata can be regarded as              sion of the dataset is provided in the Appendix.
strong indicators for the semantic indexing task         Our goal in releasing MeSHup is to promote
as they include latent information of research           large-scale ontological classification of biomedi-
topics. We therefore extract the metadata of             cal documents, using full texts, across the com-
each article from MBR; the detailed metadata             munity. With MeSHup, researchers can explore
and their descriptions are stated in Table 2.            and test state-of-the-art indexing systems with a
                                                         common standard.
We combine the information extracted from
BioC-PMC and MBR to form MeSHup, a new
large-scale, full-text biomedical semantic index-                         4.    Experiments
ing dataset. Specifically, each article in the           Given the full text of a biomedical article, MeSH
dataset contains the full textual information and        indexing can be regarded as a multi-label text
the metadata associated with it. The keys (sec-          classification problem. The learning framework
tion descriptors) from MeSHup are shown in               is defined as follows. X = {x1 , x2 , ..., xn } is a set
of biomedical documents and Y = {y1 , y2 , ..., yL }      (DCNN) to each channel in order to get the
is the set of MeSH terms. Multi-label classifica-         distant effect memory from a CNN model. To
tion studies the learning function f : X → [0, 1]Y        be specific, our DCNN is a three-layer, one-
using the training set D = {(xi , Yi ), i = 1, ..., n},   dimensional convolutional neural net (CNN)
where n is the number of documents in the set,            with dilated convolution kernels, which ob-
and Yi ⊂ Y is the set of MeSH labels for docu-            tain high-level semantic representations of texts
ment xi . The objective of MeSH indexing is to            without increasing the computation. The con-
predict the correct MeSH labels for any unseen            cept of dilated convolution has been popular
article xk , where xk is not in X .                       in semantic segmentation in computer vision in
                                                          recent years (Yu and Koltun, 2016; Li et al.,
4.1.     Baseline Model                                   2018), and it has been introduced to sequential
In general, traditional MeSH indexing systems             data (Bai et al., 2018), specially to the field of
focus on the document features only, thereby              NLP in neural machine translation (Kalchbren-
suffering from a lack of information about the            ner et al., 2017) and text classification (Lin et
MeSH hierarchy. To handle this, we present a              al., 2018). Dilated convolution enables exponen-
hybrid document-label feature method, which is            tially large receptive fields over the embedding
composed of a multi-channel document repre-               metric, which captures long-term dependencies
sentation module, a label feature representation          over the input texts.
module, and a classifier. The overall model ar-           We apply a multi-level DCNN with different di-
chitecture is shown is Figure 2.                          lation rates on top of the embedding metric
                                                          on each channel. Larger dilations represent a
                                                          wider range of inputs that can capture sentence-
                                                          level information, whereas small dilations cap-
                                                          ture word-level information. The semantic fea-
                                                          tures returned by DCNN for each channel is de-
                                                          noted as DC ∈ R(l−s+1)×2d , where l is the se-
                                                          quence length in channel C and s is the width of
                                                          the convolution kernels.

                                                          4.1.2.   Label Feature Representation
                                                                   Module
                                                          Graph convolutional neural networks (GCNs)
Figure 2: Model Architecture - There are three            (Kipf and Welling, 2017) have attracted wide at-
main components in our method. First, a multi-            tention recently. They have been effective in
channel document representation module oper-              tasks that have rich relational structures, as
ates on each section of an input article. Sec-            GCNs preserve global information within the
ond, a 2-layer GCN creates label vectors. Lastly,         graph. Traditional multi-label text classification
a label-wise attention component calculates the           mainly focuses on local consecutive word se-
label-specific attention vectors that are used for        quences; for instance, two deep networks com-
predictions.                                              monly used in building text representations af-
                                                          ter learning word embeddings are convolutional
                                                          (CNNs) and recurrent neural networks (RNNs)
                                                          (Kim, 2014; Tai et al., 2015). Some recent stud-
4.1.1.  Multi-channel Document                            ies explored GCN for text classification, where
        Representation Module                             they either viewed a document or a sentence as
The multi-channel document representation                 a graph of word nodes (Peng et al., 2018; Yao et
module has five input channels: the title and             al., 2019). MeSH labels are arrayed hierarchi-
abstract channel, the introduction channel, the           cally, an example of which is shown in Figure 3.
methods channel, the results channel, and the             We take advantage of the structured knowledge
discussion channel. Texts in each channel are             we have over the parent and child relationships
represented by the embedding matrices, namely             in the MeSH label space by using a GCN. In the
EC ∈ Rd , where d represents the dimension of             label graph setting, we formulate each MeSH la-
the word embeddings.                                      bel in Y as a node in the graph. The edges rep-
Instead of a recurrent model which poses com-             resent parent and child relationships among the
putational and technical issues, we apply a               MeSH terms. The edge types of a node contain
multi-level dilated convolutional neural network          edges from itself, from its parent, and from its
order to match documents to their correspond-
                                                            ing label vectors, we employ label-wise atten-
                                                            tion following Mullenbach et al. (2018). We
                                                            generate label-specific representations for each
                                                            channel:

                                                                          αC = Softmax(DC · Hlabel )
                                                                                         T
                                                                                                                    (4)
                                                                             contentC = αC · DC ,

                                                            where contentC ∈ RL×d . We then sum up the
                                                            label-specific content representation for each
                                                            channel to form the final document representa-
                                                            tion for each article:
Figure 3: An example of the MeSH hierarchy                                           X
(retrieved Nov. 2021). For example, under Body                                 D=        contentC                   (5)
Regions, there are specific body regions under
                                                            We then generate the predicted score for each
the MeSH tree.
                                                            MeSH label via:

                                                                      yˆi = σ(D     Hlabel ), i = 1, 2, ..., L,     (6)
children. We first take the average of the word
embeddings of the words in the MeSH label de-               where σ(·) denotes the sigmoid function. Train-
scriptor as the initial label embedding vi ∈ Rd :           ing our proposed method uses binary cross-
                                                            entropy as the loss function:
                    mi
                 1 X
          vi =          wj , i = 1, 2, ..., L,        (1)          L
                 mi j=1                                            X
                                                             L=      [−yi · log(yˆi ) − (1 − yi ) · log(1 − yˆi ))], (7)
                                                                   i=1
where mi is the number of words in the descrip-
tor of label i, wj is the word embedding of word            where yi ∈ [0, 1] is the ground truth of label i,
j , and L is the number of labels. The GCN lay-             and yˆi ∈ [0, 1] denotes the prediction of label i
ers capture information about immediate neigh-              obtained from the proposed model.
bours with one layer of convolution, and infor-
mation from a larger neighbourhood can be in-               4.2.    Evaluation Metrics
tegrated by using a multi-layer stack. We use               We evaluate performance of MeSH indexing
a two-layer GCN to incorporate the hierarchical             systems using two groups of measurements:
information among MeSH terms. At each GCN                   bipartition-based evaluation and ranking-based
layer, we aggregate the parent and child nodes              evaluation. Bipartition evaluation is further di-
for the ith label to form the new label embedding           vided into example-based and label-based met-
for the next layer:                                         rics. Example-based measures, computed per
                                                            data point, compute the harmonic mean of stan-
              hl+1 = σ(A · hl · W l ),                (2)   dard precision (EBP) and recall (EBR) for each
                                                            data point. The metrics are defined as:
where hl and hl+1 ∈ RL×d indicate the node pre-
sentations of the lth and (l + 1)th layers, σ(·) de-                                2 × EBR × EBP
                                                                           EBF =                  ,                 (8)
notes an activation function, A is the adjacency                                      EBR + EBP
matrix of the MeSH hierarchical graph, and W l
                                                            where
is a layer-specific trainable weight matrix. Next,                                      N
                                                                                     1 X |yi ∩ yˆi |
we sum both the averaged description vector                                 EBP =                    ,              (9)
from Equation 1 with the GCN output to form:                                         N i=1 |yˆi |

                                                                                        N
                 Hlabel = v + hl+1 ,                  (3)                            1 X |yi ∩ yˆi |
                                                                            EBR =                    ,            (10)
                                                                                     N i=1 |yi |
                   L×d
where Hlabel ∈ R         is the final label matrix.
                                                            where yi is the true label set and yˆi is the pre-
4.1.3. Classifier                                           dicted label set for instance i, and N represents
For large multi-label text classification tasks,            the total number of instances.
relevant information for each label might be                We perform label-based evaluation for each la-
scattered in various locations of the article. In           bel in the label set. The measurements include
Micro-average F-measure (MiF) and Macro-                         threshold τi for predicting the i-th label, and se-
average F-measure (MaF). MiF aggregates the                      lected the predicted MeSH term (MeSH i ) whose
global contributions of all MeSH labels and then                 predicted probability is greater than τi :
calculates the harmonic mean of micro-average                                               (
precision (MiP) and micro-average recall (MiR),                                                 yˆi ≥ τi , 1
                                                                               MeSHi =                           (19)
which are heavily influenced by frequent MeSH                                                   yˆi < τi , 0
terms. MaF computes the macro-average preci-
sion (MaP) and macro-average recall (MaR) for                    We used the micro-F optimization algorithm pro-
each label and then averages them, which pro-                    posed by Pal et al. (2020) to tune the thresholds:
vides equal weight to each MeSH term. There-
fore frequent MeSH terms and infrequent ones                                    τi = argmax MiF(T ),             (20)
                                                                                        T
are equally important. The aforementioned met-
rics are defined as follows:                                     where T represents all possible threshold val-
                                                                 ues for label i.
                     2 × MiR × MiP
             MiF =                 ,                      (11)
                       MiR + MiP                                 4.3.   Experiment Settings
where                                                            We implement our model using PyTorch (Paszke
                                                                 et al., 2019). We use 200-dimensional word em-
                          PL
                              j=1   TPj                          beddings (BioWordVec) that are pretrained on
         MiP = PL                   PL                ,   (12)   PubMed article titles and abstracts (Zhang et
                    j=1   TPj +           j=1   FPj
                                                                 al., 2019). For the model’s DCNN component,
                          PL                                     we use a 1-dimension convolution with kernel
                              j=1   TPj                          size 3 and a three-level dilated convolution with
         MiR = PL                   PL                ,   (13)
                    j=1   TPj +           j=1   FNj              dilation rates [1,2,3]. The number of hidden
                                                                 units in both components of our model is set to
                     2 × MaR × MaP                               200. We use the Adam optimizer (Kingma and
            MaF =                  ,                      (14)
                       MaR + MaP                                 Ba, 2015) with a minibatch size of 8 and an ini-
where                                                            tial learning rate of 0.0003 with a decay rate of
                          L                                      0.9 in every epoch. To avoid overfitting, we ap-
                     1X       TPj
            MaP =                    ,                    (15)   ply dropout directly after the embedding layer
                     L j=1 TPj + FPj
                                                                 with a rate of 0.2 and use early stopping strate-
                                                                 gies (Yao et al., 2007). Our model is trained on a
                          L
                     1X       TPj                                single NVIDIA A100 GPU. It takes approximately
            MaR =                    ,                    (16)
                     L j=1 TPj + FNj                             five to seven days to train the full model.

where TPj , FPj and FNj are the true positives,                  4.4.   Experimental Results
false positives, and false negatives, respectively,              In the baseline model, we are interested in the
for each label lj in the set of all labels L.                    articles that have all six sections, i.e., title, ab-
Ranking-based evaluation includes precision at                   stract, introduction, method, results, and dis-
k (P@k ) and recall at k (R@k ). P@k shows                       cuss. We extract the 957,426 articles from
the number of relevant MeSH terms that are                       MeSHup that meet these criteria. We use strat-
suggested in the top-k recommendations of the                    ified sampling over publication year to split
MeSH indexing system, and R@k indicates the                      our dataset into training, validation and test-
proportion of relevant items that are suggested                  ing. We use 80% of the documents for training
in the top-k recommendations. The metrics are                    (765,920), 10% for validation (95,737), and 10%
defined as follows:                                              for testing (95,769).
                                                                 We first conduct our experiments with titles
                          1 X
               P@k =          yl ,                        (17)   and abstracts only, and then we do our experi-
                          k
                              l∈rk (ŷ)                          ments on the full texts. From this experiment,
                                                                 we would like to see how integrating full text
                           1 X                                   information affects the indexing performance
              R@k =             yl ,                      (18)
                          |yi |                                  compared with using the titles and abstracts
                               l∈rk (ŷ)
                                                                 only. Table 3 summarizes the results of bipar-
where rk returns the top-k recommended items.                    tition evaluation and Table 4 shows the results
Thresholds greatly impact the bipartition-based                  of ranking-based measures. We can see sub-
evaluation metrics. We therefore tuned the                       stantial improvements on all evaluation metrics
Bipartition evaluation               Methods                  our analysis covers several but not all sections
                           Titles and Abstracts    Full Texts
                    EBF            0.183            0.259
                                                                of the full text articles, it is likely that other
  Example based     EBP            0.503            0.588       parts of the article together with metadata may
                    EBR            0.112            0.166       also have impacts on future outcomes. In future,
                    MiF            0.177            0.259
 Micro-averaged     MiP            0.473            0.604
                                                                we plan to involve more sections of the textual
                    MiR            0.110            0.164       information and metadata to improve the auto-
                    MaF            0.362            0.367       matic indexing system.
 Macro-averaged     MaP            0.798            0.810
                    MaR            0.234            0.237                   Acknowledgements
Table 3: Comparison using only titles and ab-                   We thank all reviewers for their constructive
stracts and full texts across bipartition evalua-               comments and feedback. Resources used in
tion. Bold: best scores in each row.                            preparing this research were provided, in part,
                                                                by Compute Ontario13 , Compute Canada14 , the
 Ranking Based                                                  Province of Ontario, the Government of Canada
                                   Methods
   Measure                                                      through CIFAR, and companies sponsoring the
                      Titles and Abstracts        Full Texts
                                                                Vector Institute15 . This research is partially
           P @1               0.699                0.801
                                                                funded by The Natural Sciences and Engi-
           P @3               0.462                0.609
 P@k       P @5               0.372                0.496        neering Research Council of Canada (NSERC)
           P @10              0.260                0.341        through a Discovery Grant to R. E. Mercer. F.
           P @15              0.205                0.267        Rudzicz is supported by a CIFAR Chair in AI.
           R@1                0.051                0.077
           R@3                0.098                0.128        Appendix: A Complete Version of A
 R@k       R@5                0.131                0.171           Data Sample in the Dataset
           R@10               0.180                0.232
           R@15               0.214                0.272        {"articles":[
                                                                    {"PMID":"27976717",
Table 4: Comparison using only titles and ab-                        "TITLE":"Temporal pairwise spike
stracts and full texts across ranking-based mea-                         correlations fully capture
sures. Bold: best scores in each row.                                    single-neuron information",
                                                                     "ABSTRACT":"To crack the neural
when involving full texts, which indicates that                          code and read out the
full texts are more informative compared to the                          information neural spikes
titles and abstracts. The baseline model pre-                            convey, [...]",
forms fairly well on precisions, but with a trade                    "INTRO":"Throughout the central
off in recalls. The reason for this could be that                        nervous system of a mammalian
the frequency of each MeSH label is quite bi-                            brain [...]",
ased, some labels might have very few training                       "METHODS":"Deriving the correlation
examples so that the baseline model is very hard                          theory of neural information [
to predict those rare labels.                                            ...]",
                                                                     "RESULTS":"We are interested in the
                                                                          information contained in a
  5.    Conclusion and Future Work
                                                                         spike train r(t) about a
We present MeSHup, a new, publicly available                             stimulus s(t)[...]",
full text dataset annotated for MeSH indexing.                       "DISCUSS":"The list of spike timing
It is a mashup of full-text information from BioC-                        features that have been
PMC and associated metadata collected from                               implicated in neural coding
the MEDLINE database. This is the first large                            includes [...]",
dataset that contains the full text information                      "FIG_CAPTIONS":"Dimensionality of
that allows the research community to incor-                             neural information coding [...]
porate more textual information other than the                           ",
title and abstract in building MeSH indexing                         "TABLE_CAPTIONS":"Parameter sets
systems. We also train an end-to-end model                               across neuron models. [...]",
that comprises features extracted from the doc-
ument itself and features obtained from labels.
We think that the MeSHup dataset could be a                      13
                                                                    https://www.computeontario.ca
valuable resource not only for MeSH indexing                     14
                                                                    https://www.computecanada.ca
                                                                 15
but also for full text mining and retrieval. Since                  https://www.vectorinstitute.ai/partners
"JOURNAL":"Nature communications",               text articles annotated for curation of protein-
      "YEAR":"2016",                                   protein and genetic interactions. Database
      "DOI":"10.1038/ncomms13805",                     (Oxford).
      "AUTHORS":[                                    Gasperin, C., Karamanis, N., and Seal, R.
         "Amadeus,Dettner",                            (2007). Annotation of anaphoric relations in
         "Sabrina,Munzberg",                           biomedical full text articles using a domain-
         "Tatjana,Tchumatchenko"],                     relevant scheme. In Proceedings of the 6th
      "MeSH": {                                        Discourse Anaphora and Anaphor Resolution
        "D000200":"Action Potentials",                 Colloquium (DAARC 2007), pages 19–24.
        "D008959":"Models, Neurological",            Gay, C. W., Kayaalp, M., and Aronson, A. R.
        "D009474":"Neurons",                           (2005). Semi-automatic indexing of full text
        "D059010":"Single-Cell Analysis"               biomedical articles. AMIA Annual Symposium
      },                                               Proceedings, pages 271–275.
      "CHEMICALS":"None",                            Hirschman, L., Yeh, A., Blaschke, C., and Va-
      "SUPPLMeSH":"None"                               lencia, A. (2005). Evaluation of BioCreAtIvE
     },                                                assessment of task 2. BMC Bioinformatics,
     {                                                 6(Suppl 1):S16.
      ...
                                                     Islamaj, R., Leaman, R., Kim, S., Kwon, D., Wei,
     },
                                                        C.-H., Comeau, D. C., Peng, Y., Cissel, D.,
     ...
                                                        Coss, C., annd Rob Guzman, C. F., Kochar,
]}
                                                        P. G., Koppel, S., Trinh, D., Sekiya, K., Ward, J.,
                                                        Whitman, D., Schmidt, S., and Lu, Z. (2021).
     6.   Bibliographical References                    NLM-Chem, a new resource for chemical en-
                                                        tity recognition in PubMed full text literature.
Aronson, A. R. and Lang, F.-M. (2010). An
                                                        Scientific Data, 8(91).
  overview of MetaMap: Historical perspective
  and recent advances. Journal of the American       Jimeno-Yepes, A., Plaza, L., Mork, J. G., Aron-
  Medical Informatics Association, 17(3):229–          son, A. R., and Díaz, A. (2012). Mesh indexing
  236, 05.                                             based on automatically generated summaries.
Aronson, A., Mork, J. G., Gay, C. W., Humphrey,        BMC Bioinformatics, 14:208 – 208.
  S., and Rogers, W. J. (2004). The NLM Index-       Jin, Q., Dhingra, B., and Cohen, W. W. (2018).
  ing Initiative’s Medical Text Indexer. Studies        AttentionMeSH: Simple, effective and inter-
  in Health Technology and Informatics, 107(Pt          pretable automatic MeSH indexer. In Pro-
  1):268–272.                                           ceedings of the 2018 EMNLP Workshop
Bai, S., Kolter, J. Z., and Koltun, V. (2018).          BioASQ: Large-scale Biomedical Semantic In-
  Convolutional sequence modeling revisited.            dexing and Question Answering, pages 47–56.
  In Proceedings of the 6th International Con-       Kalchbrenner, N., Espeholt, L., Simonyan,
  ference on Learning Representations (ICLR)           K., van den Oord, A., Graves, A., and
  Workshop.                                            Kavukcuoglu, K. (2017). Neural machine
Comeau, D. C., Wei, C.-H., Islamaj, D. R., and Lu,     translation in linear time. arXiv preprint
  Z. (2019). PMC text mining subset in BioC:           arXiv:1610.10099.
  about 3 million full text articles and growing.    Kim, Y. (2014). Convolutional neural networks
  Bioinformatics, pages 3533–3535.                     for sentence classification. In Proceedings of
Dai, S., You, R., Lu, Z., Huang, X., Mamitsuka,        the 2014 Conference on Empirical Methods
  H., and Zhu, S. (2019). FullMeSH: improv-            in Natural Language Processing (EMNLP),
  ing large-scale MeSH indexing with full text.        pages 1746–1751.
  Bioinformatics, 36(5):1533–1541, 10.               Kingma, D. P. and Ba, J. (2015). Adam: A
Demner-Fushman, D. and Mork, J. G. (2015).             method for stochastic optimization. arXiv
  Extracting characteristics of the study sub-         preprint arXiv:1412.6980.
  jects from full-text articles. In Proceedings of   Kipf, T. and Welling, M.           (2017). Semi-
  the American Medical Informatics Association         supervised    classification       with  graph
  (AMIA) Annual Symposium, pages 484–491.              convolutional networks.          arXiv preprint
Dogan, R. I., Kim, S., Chatr-Aryamontri, A.,           arXiv:1609.02907.
  Chang, C. S., Oughtred, R., Rust, J., Wilbur,      Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-
  W. J., Comeau, D. C., Dolinski, K., and Tyers,       H., Leaman, R., Davis, A. P., Mattingly, C. J.,
  M. (2017). The BioC-BioGRID corpus: full             Wiegers, T. C., and Lu, Z. (2015). Annotating
chemicals, diseases and their interactions in         Gimelshein, N., Antiga, L., Desmaison, A.,
  biomedical literature. In Proceedings of the          Kopf, A., Yang, E., DeVito, Z., Raison, M., Te-
  Fifth BioCreative Challenge Evaluation Work-          jani, A., Chilamkurthy, S., Steiner, B., Fang,
  shop, pages 173–182.                                  L., Bai, J., and Chintala, S. (2019). PyTorch:
Li, Y., Zhang, X., and Chen, D. (2018). CSRNet:         An imperative style, high-performance deep
  Dilated convolutional neural networks for             learning library. In H. Wallach, et al., edi-
  understanding the highly congested scenes.            tors, Advances in Neural Information Process-
  2018 IEEE/CVF Conference on Computer Vi-              ing Systems 32, pages 8024–8035. Curran As-
  sion and Pattern Recognition, pages 1091–             sociates, Inc.
  1100.                                               Peng, S., You, R., Wang, H., Zhai, C., Mamit-
Lin, J. and Wilbur, W. J. (2007). PubMed related        suka, H., and Zhu, S. (2016). DeepMeSH:
  articles: A probabilistic topic-based model           deep semantic representation for improving
  for content similarity. BMC Bioinformatics,           large-scale MeSH indexing. Bioinformatics,
  8(1):423.                                             32(12):i70–i79.
Lin, J., Su, Q., Yang, P., Ma, S., and Sun, X.        Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L.,
  (2018). Semantic-unit-based dilated convolu-          Song, Y., and Yang, Q. (2018). Large-scale hi-
  tion for multi-label text classification. In Pro-     erarchical text classification with recursively
  ceedings of the 2018 Conference on Empiri-            regularized deep graph-cnn. Proceedings of
  cal Methods in Natural Language Processing,           the 2018 World Wide Web Conference.
  pages 4554–4564.                                    Tai, K. S., Socher, R., and Manning, C. D. (2015).
Liu, K., Peng, S., Wu, J., Zhai, C., Mamitsuka, H.,     Improved semantic representations from tree-
  and Zhu, S. (2015). MeSHLabeler: Improving            structured long short-term memory networks.
  the accuracy of large-scale mesh indexing by          In Proceedings of the 53rd Annual Meeting
  integrating diverse evidence. Bioinformatics,         of the Association for Computational Linguis-
  31(12):i339–i347.                                     tics and the 7th International Joint Confer-
Mork, J. G., Jimeno-Yepes, A., and Aron-                ence on Natural Language Processing (Vol-
 son, A. R. (2013). The NLM Medical                     ume 1: Long Papers), pages 1556–1566.
 Text Indexer system for indexing biomedi-            Tsatsaronis, G., Balikas, G., Malakasiotis, P., Par-
 cal literature. In Proceedings of the First            talas, I., Zschunke, M., Alvers, M. R., Weis-
 Workshop on Bio-Medical Semantic Indexing              senborn, D., Krithara, A., Petridis, S., Poly-
 and Question Answering (BioASQ). CEUR-                 chronopoulos, D., Almirantis, Y., Pavlopou-
 WS.org, online http://CEUR-WS.org/Vol-1094/            los, J., Baskiotis, N., Gallinari, P., Artieres, T.,
 bioasq2013_submission_3.pdf.                           Ngonga, A., Heino, N., Gaussier, E., Barrio-
Mork, J. G., Aronson, A. R., and Demner-                Alvers, L., Schroeder, M., Androutsopoulos, I.,
 Fushman, D. (2017). 12 years on – Is the NLM           and Paliouras, G. (2015). An overview of the
 medical text indexer still useful and relevant?        BIOASQ large-scale biomedical semantic in-
 Journal of Biomedical Semantics, 8(1):8.               dexing and question answering competition.
Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J.,       BMC Bioinformatics, 16:138.
 and Eisenstein, J. (2018). Explainable predic-       Verspoor, K., Cohen, K. B., Lanfranchi, A.,
 tion of medical codes from clinical text. In           Warner, C., Johnson, H. L., Roeder, C., Choi,
 Proceedings of the 2018 Conference of the              J. D., Funk, C., Malenkiy, Y., Eckert, M., Xue,
 North American Chapter of the Association              N., Baumgartner Jr., W. A., Bada, M., Palmer,
 for Computational Linguistics: Human Lan-              M., and Hunter, L. E. (2012). A corpus of full-
 guage Technologies, Volume 1 (Long Papers),            text journal articles is a robust evaluation tool
 pages 1101–1111, New Orleans, Louisiana,               for revealing differences in performance of
 June. Association for Computational Linguis-           biomedical natural language processing tools.
 tics.                                                  BMC Bioinformatics, 13:207.
Pal, A., Selvakumar, M., and Sankarasubbu,            Wang, X. and Mercer, R. E. (2019). Incor-
  M. (2020). Multi-label text classification us-       porating figure captions and descriptive text
  ing attention-based graph neural network. In         in MeSH term indexing. In Proceedings of
  Proceedings of The 12th International Con-           the 18th BioNLP Workshop and Shared Task,
  ference on Agents and Artificial Intelligence        pages 165–175.
  (ICAART), pages 494–505.                            Wang, X., Mercer, R. E., and Rudzicz, F. (2022).
Paszke, A., Gross, S., Massa, F., Lerer, A.,           KenMeSH: Knowledge-enhanced end-to-end
  Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,      biomedical text labelling. arXiv preprint
arXiv:2203.06835.                                 and Miji Choi and Karin M. Verspoor and Ma-
Xun, G., Jha, K., Yuan, Y., Wang, Y., and Zhang,    dian Khabsa and C. Lee Giles and Hongfang
  A. (2019). MeSHProbeNet: A self-attentive         Liu and K. E. Ravikumar and Andre Lamurias
  probe net for MeSH indexing. Bioinformatics.      and Francisco M. Couto and Hong-Jie Dai
Yao, Y., Rosasco, L., and Caponnetto, A. (2007).    and Richard Tzong-Han Tsai and C Ata and
  On early stopping in gradient descent learn-      Tolga Can and Anabel Usie and Rui Alves and
  ing. Constructive Approximation, 26:289–          Isabel Segura-Bedmar and Paloma Martínez
  315.                                              and Julen Oyarzábal and Alfonso Valencia.
Yao, L., Mao, C., and Luo, Y. (2019). Graph         (2014). The CHEMDNER corpus.
  convolutional networks for text classification.
  In Proceedings of the Thirty-Third AAAI Con-
  ference on Artificial Intelligence and Thirty-
  First Innovative Applications of Artificial In-
  telligence Conference and Ninth AAAI Sym-
  posium on Educational Advances in Artificial
  Intelligence, pages 7370–7377.
You, R., Liu, Y., Mamitsuka, H., and Zhu,
  S. (2020). BERTMeSH: deep contextual
  representation learning for large-scale high-
  performance MeSH indexing with full text.
  Bioinformatics, 37(5):684–692.
Yu, F. and Koltun, V. (2016). Multi-scale context
  aggregation by dilated convolutions. arXiv
  preprint arXiv:1511.07122.
Zhang, Y., Chen, Q., Yang, Z., Lin, H., and Lu,
  Z. (2019). BioWordVec, improving biomedical
  word embeddings with subword information
  and MeSH. Scientific Data, 6.

7.   Language Resource References
Comeau, Donald C and Wei, Chih-Hsuan and
   Islamaj Doğan, Rezarta and Lu, Zhiyong.
   (2019). BioC corpus.
William R. Hersh and Chris Buckley and T.
   J. Leone and David H. Hickam. (1994).
   OHSUMED: an interactive retrieval evalua-
   tion and new large test collection for re-
   search.
Jin-Dong Kim and Tomoko Ohta and Yuka Tateisi
   and Junichi Tsujii. (2003). GENIA corpus.
Martin Krallinger and Obdulia Rabal and Flo-
   rian Leitner and Miguel Vazquez and David
   Salgado and Zhiyong Lu and Robert Leaman
   and Yanan Lu and Dong-Hong Ji and Daniel
   M. Lowe and Roger A. Sayle and Riza Theresa
   Batista-Navarro and Rafal Rak and Torsten
   Huber and Tim Rocktäschel and Sérgio Matos
   and David Campos and Buzhou Tang and Hua
   Xu and Tsendsuren Munkhdalai and Keun Ho
   Ryu and S. V. Ramanan and P. Senthil Nathan
   and Slavko Zitnik and Marko Bajec and Lutz
   Weber and Matthias Irmer and Saber Ahmad
   Akhondi and Jan A. Kors and Shuo Xu and Xin
   An and Utpal Kumar Sikdar and Asif Ekbal
   and Masaharu Yoshioka and Thaer M. Dieb
You can also read