Adapting Linguistic Tools for the Analysis of Italian Medical Records

 
CONTINUE READING
10.12871/CLICIT201414

Adapting Linguistic Tools for the Analysis of Italian Medical Records

    Giuseppe Attardi                           Vittoria Cozza                      Daniele Sartiano
Dipartimento di Informatica              Dipartimento di Informatica           Dipartimento di Informatica
    Università di Pisa                       Università di Pisa                    Università di Pisa
  Largo B. Pontecorvo, 3                   Largo B. Pontecorvo, 3                Largo B. Pontecorvo, 3
attardi@di.unipi.it                       cozza@di.unipi.it                   sartiano@di.unipi.it

                                                             formation extracted from natural language texts,
                      Abstract                               can supplement or improve information extracted
                                                             from the more structured data available in the
    English. We address the problem of recogni-              medical test records.
    tion of medical entities in clinical records                Clinical records are expressed as plain text in
    written in Italian. We report on experiments             natural language and contain mentions of diseas-
    performed on medical data in English provid-             es or symptoms affecting a patient, whose accu-
    ed in the shared tasks at CLEF-ER 2013 and
                                                             rate identification is crucial for any further text
    SemEval 2014. This allowed us to refine
    Named Entity recognition techniques to deal              mining process.
    with the specifics of medical and clinical lan-             Our task in the project is to provide a set of
    guage in particular. We present two ap-                  NLP tools for extracting automatically infor-
    proaches for transferring the techniques to              mation from medical reports in Italian. We are
    Italian. One solution relies on the creation of          facing the double challenge of adapting NLP
    an Italian corpus of annotated clinical records          tools to the medical domain and of handling doc-
    and the other on adapting existing linguistic            uments in a language (Italian) for which there are
    tools to the medical domain.                             few available linguistic resources.
                                                                Our approach to information extraction ex-
    Italiano. Questo lavoro affronta il problema
                                                             ploits both supervised machine-learning tools,
    del riconoscimento di entità mediche in refer-
    ti medici in lingua italiana. Riferiamo su de-
                                                             which require annotated training corpora, and
    gli esperimenti svolti su testi medici in ingle-         unsupervised deep learning techniques, in order
    se forniti nei task di CLEF-ER 2013 e SemE-              to leverage unlabeled data.
    val 2014. Questi ci hanno consentito di raffi-              For dealing with the lack of annotated Italian
    nare le tecniche di Named Entity recognition             resources for the bio-medical domain, we at-
    per trattare le specificità del linguaggio me-           tempted to create a silver corpus with a semi-
    dico e in particolare quello dei referti clinici.        automatic approach that uses both machine trans-
    Presentiamo due approcci al trasferimento di             lation and dictionary based techniques. The cor-
    queste tecniche all’italiano. Una soluzione              pus will be validated through crowdsourcing.
    consiste nella creazione di un corpus di refer-
    ti medici in italiano annotato con entità me-            2    Medical Training Corpus
    diche e l’altro nell’adattare strumenti tradi-
    zionali per l’analisi linguistica al dominio             Currently Italian corpora annotated with men-
    medico.                                                  tions of medical terms are not easily available.
                                                             Hence we decided to create a corpus of Italian
1    Introduction                                            medical reports (IMR), annotated with medical
One of the objectives of the RIS project (RIS                mentions and to make it available on demand.
2014) is to develop tools and techniques to help                The corpus consists of 10,000 sentences, ex-
identifying patients at risk of evolving their dis-          tracted from a collection of 23,695 clinical rec-
ease into a chronic condition. The study relies on           ords of various types, including discharge sum-
a sample of patient data consisting of both medi-            maries, diagnoses, and medical test reports.
cal test reports and clinical records. We are inter-            The annotation process consists in two steps:
ested in verifying whether text analytics, i.e. in-          creating a silver corpus using automated tools

                                                        17
and then turning the corpus into a gold one by              4. the annotated document is enriched by adding
manually correcting the annotations.                           CUIs to each entity, looking up the pair (enti-
   For building the silver corpus we used:                     ty, group) in the dictionary of CUIs, of step 1.
 a NER trained over a silver English resource              For machine translation we trained Moses
    translated to Italian;                                  (Koehn, 2007) using a biomedical parallel corpus
 a dictionary-based entity recognition ap-                 consisting of EMEA, Medline and the Spanish
    proach.                                                 Wikipedia for the language model.
For converting the silver corpus into a gold one,              In task A of the challenge, on mention identi-
validation by medical experts is required. We               fication, our submission achieved the best score
organized a crowdsourcing campaign, for which               for the categories disease, anatomical part, live
we are recruiting volunteers to whose we will               being and drugs, with scores ranging between
assign micro annotation tasks. Special care will            91.5% and 97.4% (Rebholz-Schuhmann et al.,
be taken to collected answers reliability.                  2013). In task B on CUI identification, the scores
                                                            were however much lower.
2.1   Translation based approach                               As NE tagger, we used the Tanl NER (Attardi
The CLEF-ER 2013 challenge (Rebholz-                        et al., 2009), a generic sequence tagger based on
Schuhmann et al., 2010) aimed at the identifica-            a Maximum Entropy Markov Model, that uses a
tion of mentions in bio-medical texts in various            rich feature set, customizable by providing a fea-
languages, starting from an annotated resource in           ture model. Such kinds of taggers perform quite
English, and at assigning to them a concept                 well on newswire documents, where capitaliza-
unique identifier (CUI) from the UMLS thesau-               tion features are quite helpful in identifying peo-
rus (Bodenreider, 2004). UMLS combines sever-               ple, organization and locations. With a proper
al multilingual medical resources, including Ital-          feature model we were able to achieve satisfacto-
ian terminology from MedDRA Italian                         ry results also for medical domain.
(MDRITA15_1)             and     MESH        Italian           Adapting the CLEF-ER approach to Italian re-
(MSHITA2013), bridged through their CUI’s to                quired repeating the translation step with an en-it
their English counterparts.                                 parallel corpus, consisting of EMEA and UMLS
   The organizers provided a silver standard cor-           for the medical domain and (Europarl, 2014;
pus (SSC) in English, consisting of 364,005 sen-            JRC-Acquis, 2014) for more general domains.
tences extracted from the EMEA corpus, which                   A NE tagger for Italian was the trained on the
had been automatically annotated by combining               translated silver corpus.
the outputs of several Named Entity taggers                    Due to a lack of annotated Italian medical
(Rebholz-Schuhmann et al., 2010).                           texts, we couldn’t perform validation on the re-
   In (Attardi et al., 2013) we proposed a solution         sulting tagger. Manual inspection confirms the
for annotating Spanish bio-medical texts, starting          hypothesis that accuracy is similar to the Spanish
from the SSC English resource. Our approach                 version, given that the major difference in the
combined techniques of machine translation and              process is the translation corpus and that Spanish
NER and consists of the following steps:                    and Italian are similar languages.

1. phrase-based statistical machine translation is          2.2   Dictionary based approach
   applied to the SSC in order to obtain a corre-           Since the terminology for entities in medical rec-
   sponding annotated corpus in the target lan-             ords is fairly restricted, another method for iden-
   guage. A mapping between mentions in the                 tifying mentions in the IMR corpus is to use a
   original and the corresponding ones in the
                                                            dictionary. We produced an Italian medical the-
   translation is preserved, so that the CUIs from
   the original can be transferred to the transla-          saurus by merging:
   tion. This produces a Bronze Standard Corpus              over 70,000 definitions of treatments and
   (BSC) in the target language. A dictionary of                 diagnosis from the ICD-9-CM terminology;
   entities is also created, which associates to             about 22,000 definitions from the SnoMed
   each pair (entity text, semantic group) the cor-              semantic group “Symptoms and Signs, Dis-
   responding CUIs that appeared in the SSC.                     ease and Anatomical part” in the UMLS;
2. the BSC is used to train a model for a Named              over 2,600 active ingredients and drugs from
   Entity tagger, capable of assigning semantic                  the “Lista dei Farmaci” (AIFA, 2014).
   groups to mentions.
3. the model built at step 2) is used for tagging           We identified mentions in the IMR corpus using
   entities in sentences in the target language.            two techniques: n-gram based and parser based.

                                                       18
The IMR text is preprocessed first with the Tanl             The training set consisted of 199 notes with
pipeline, performing sentence splitting, tokeniza-        5,816 annotations, the development set of 99
tion, POS tagging and lemma extraction.                   notes with 5,351 annotations.
   To ease matching, text is normalized by low-              For the first step we explored using or adapt-
ercasing each word.                                       ing traditional NER techniques. We performed
   The n-gram based technique tries matching              experiments in the context of the SemEval 2014
each n-gram (with n between 1 and 10) in the              task 7, Analysis of Clinical Text, where we could
corpus with entries in the thesaurus in two ways:         try these techniques using suitable training and
with lemma match and with approximate match.              test data, even though in English rather than Ital-
Approximate matching involves removing prep-              ian (Attardi et al., 2014). We tested several NER
ositions, punctuations and articles from both n-          tools: the Tanl NER, the Stanford NER
grams and entries and performing an exact                 (CRF_NER, 2014) and a Deep Learning NER
match.                                                    (NLPNET, 2014), which we developed based on
   The parse based matching enables to deal also          the SENNA architecture (SENNA, 2011). While
with some kinds of discontiguous mentions, i.e.           the first two taggers rely on a rich feature sets
entity mentions interleaved with modifiers, e.g.          and supervised learning, the Deep Learning tag-
adjectives, verb or adverbs. The matching pro-            ger uses almost no features and relies on word
cess involves parsing each sentence in the IMR            embeddings, learned through an unsupervised
with the DeSR parser (Attardi et al., 2009), se-          process from unannotated texts, along the ap-
lecting noun phrases corresponding to subtrees            proach by Collobert et al. 2011.
whose root is a noun and consisting of certain               We created an unannotated corpus (UC) com-
patterns of nouns, adjectives and prepositions,           bining 100,000 terms from the English Wikipe-
and finally searching these noun phrases in the           dia and 30,000 additional terms from a subset of
thesaurus.                                                unannotated medical texts from the MIMIC cor-
                                                          pus (Moody and Marks, 1996). The word em-
                                                          beddings for the UC are computed by training a
                                                          deep learning architecture initialized to the val-
                                                          ues provided by Al-Rfou’ et al. (2013) for the
                                                          English Wikipedia and to random values for the
                                                          medical terms.
                                                             All taggers use dictionary features. We created
                                                          a dictionary of disease terms (about 22,000 terms
    Figure 1: A parsed sentence from the IMR.             from the “Disease or Syndrome” semantic type
Figure 1 shows a sample sentence and its parse            of UMLS) excluding the most frequent words
tree from which the following noun phrases are            from Wikipedia.
identified: reumatismo acuto, reumatismo ar-                 The Tanl NER could be customized with addi-
ticolare, reumatismo in giovane età, gio-                 tional cluster features, extracted from a small
vane età. Among these, reumatismo acuto is
                                                          window of input tokens. The clusters of UC
recognized as an ICD9 disease.                            terms were calculated using the following algo-
   Overall, we were able to identify over 100,000         rithms:
entities in the IMR corpus by means of the dic-            DBSCAN (Ester et al., 1996) as implement-
tionary approach. The process is expected to                   ed in scikit-learn (SCIKIT, 2014). Applied to
guarantee high precision: manual inspection of                 the word embeddings it produced a set of
4,000 sentences, detected a 96% precision.                     572 clusters.
   Besides recognized entities in the thesaurus,           Continuous Vector Representation of Words
we annotated temporal expressions by a version                 (Mikolov et al., 2013), using the word2vec
of HeidelTime (Strotgen and Gertz, 2013)                       library (WORD2VEC, 2014) with several
adapted to Italian (Attardi and Baronti, 2014).                settings.
                                                          We obtained the best accuracy with word2vec
3    NE Recognition                                       using a set of 2,000 clusters.
We split the analysis of medical records into two         3.1   Conversion to IOB format
steps: recognition of mentions and assignment of
unique identifiers (CUI) to those mentions.               Before applying NE tagging, we had to convert
                                                          the medical records into the IOB format used by

                                                     19
most NE taggers. The conversion is not straight-            is created if the terms belong to the same men-
forward since clinical reports contain discontigu-          tion, a negative one otherwise.
ous and overlapping mentions. For example, in:                 The classifier was trained using a binned dis-
      Abdomen is soft, nontender, non-                      tance feature and dictionary features, extracted
      distended, negative bruits                            for each pair of words in the training set.
                                                               For mapping entities to CUIs we applied fuzzy
there are two mentions: Abdomen nontender and               matching (Fraser, 2011) between the extracted
Abdomen bruits.
                                                            mentions and the textual description of entities
   The IOB format does not allow either discon-             present in a set of UMLS disorders. The CUI
tinuity or overlaps. We tested two conversions:             from the match with highest score is chosen.
one by replicating a sentence, each version hav-               Our submission reached a comparable accura-
ing a single mention from a set of overlapping              cy to the best ones based on a single system ap-
ones. The second approach consisted in using                proach (Pradhan et al., 2014), with an F-score of
additional tags for disjoint and shared mentions            0.65 for Task A and 0.83 for Task A relaxed. For
(Tang et al., 2013): DISO for contiguous men-               Task B and Task B relaxed the accuracies were
tions; DDISO for disjoint entity words that are             0.46 and 0.70 respectively. Better results were
not shared by multiple mentions; HDISO for the              achieved by submissions that used an ensemble
head word that belongs to more than one disjoint            of taggers.
mentions.                                                      We also attempted combinations of the out-
   We tested the accuracy of various NE taggers             puts from the Tanl NER (with word2vec cluster
on the SemEval development set. The results are             features), Nlpnet NER and Stanford NER in sev-
reported in Table 1. Results marked with discont            eral ways. The best results were obtained by a
were obtained with the additional tags for dis-             simple one voting approach, taking the union of
contiguous and overlapping mentions.                        all annotations. The results of the evaluation, for
                                                            both the multiple copies and discount annotation
       NER           Precision   Recall   F- score          style, are shown below:
  Tanl                80.41      65.08     71.94
  Tanl+dbscan         80.43      64.48     71.58                  NER             Precision   Recall   F- score
  Tanl+word2vec       79.70      67.44     73.06            Agreement multiple     73.96      73.68     73.82
  Nlpnet              80.29      62.51     70.29            Agreement discont      81.69      65.85     72.92
  Stanford             80.30     64.89     71.78
  CRFsuite             79.69     61.97     69.72            Table 2: Accuracy of NER system combination.
  Tanl discont         78.57     65.35     71.35
  Nlpnet discont       77.37     63.76     69.61            4    Conclusions
  Stanford discont     80.21     62.79     70.44
                                                            We presented a series of experiments on bio-
      Table 1: Accuracy on the development set.             medical texts from both medical literature and
                                                            clinical records, in multiple languages, that
3.2      Semeval 2014 NER for clinical text                 helped us to refine the techniques of NE recogni-
The task 7 of SemEval 2014 allowed us to test               tion and to adapt them to Italian. We explored
NE tagging techniques on medical records and to             supervised techniques as well as unsupervised
adapt them to the task. Peculiarly, only one class          ones, in the form of word embeddings or word
of entities, namely diseases, is present in the cor-        clusters. We also developed a Deep Learning NE
pus.                                                        tagger that exploits embeddings. The best results
   We dealt with overlapping mentions by con-               were achieved by using a MEMM sequence la-
verting the annotations. Discontiguous mentions             beler using clusters as features improved in an
were dealt in two steps: the first step identifies          ensemble combination with other NE taggers.
contiguous portions of a mention with a tradi-                 As an further contribution of our work, we
tional sequence labeler; then separate portions of          produced, by exploiting semi-automated tech-
mentions are combined into a full mention with              niques, an Italian corpus of medical records, an-
guidance from a Maximum Entropy classifier                  notated with mentions of medical terms.
(Berger et al., 1996), trained to recognize which
pairs belong to the same mention. The training              Acknowledgements
set consists of all pairs of terms within a docu-           Partial support for this work was provided by
ment annotated as disorders. A positive instance            project RIS (POR RIS of the Regione Toscana,
                                                            CUP n° 6408.30122011.026000160).

                                                       20
References                                                    JRC-Acquis Multilingual Parallel Corpus, Version
                                                                 2.2.          2014.          Retrieved          from:
AIFA open data. 2014. Retrieved from:                            http://optima.jrc.it/Acquis/index_2.2.html
   http://www.agenziafarmaco.gov.it/it/content/dati-          Philipp Koehn, et al. 2007. Moses: Open source
   sulle-liste-dei-farmaci-open-data                             toolkit for statistical machine translation. In Proc.
Rami Al-Rfou’, Bryan Perozzi, and Steven Skiena.                 of the 45th Annual Meeting of the ACL on Interac-
   2013. Polyglot: Distributed Word Representations              tive Poster and Demonstration Sessions. ACL.
   for Multilingual NLP. In Proc. of Conference on            MDRITA15_1. 2012. Medical Dictionary for Regula-
   Computational Natural Language Learning,                      tory Activities Terminology (MedDRA) Version
   CoNLL 2013, pp. 183-192, Sofia, Bulgaria.                     15.1, Italian Edition; MedDRA MSSO; September,
Giuseppe Attardi et al., 2009. Tanl (Text Analytics              2012.
   and Natural Language Processing). SemaWiki pro-            MSHITA2013. 2013. Italian translation of Medical
   ject: http://medialab.di.unipi.it/wiki/SemaWiki               Subject Headings. Istituto Superiore di Sanità, Set-
Giuseppe Attardi, et al. 2009. The Tanl Named Entity             tore Documentazione. Roma, Italy.
   Recognizer at Evalita 2009. In Proc. of Workshop           Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jef-
   Evalita’09 - Evaluation of NLP and Speech Tools               fey Dean. 2013. Efficient Estimation of Word Rep-
   for Italian, Reggio Emilia, ISBN 978-88-903581-1-             resentations in Vector Space. In Proc. of Workshop
   1.                                                            at ICLR.
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and         George B. Moody and Roger G. Mark. 1996. A Data-
   Joseph Turian. 2009. Accurate Dependency Pars-                base to Support Development and Evaluation of In-
   ing with a Stacked Multilayer Perceptron. In Proc.            telligent Intensive Care Monitoring. Computers in
   of Workshop Evalita’09 - Evaluation of NLP and                Cardiology 23:657–660.
   Speech Tools for Italian, Reggio Emilia, ISBN              NLPNET.             2014.        Retrieved          from
   978-88-903581-1-1.                                            https://github.com/attardi/nlpnet/
Giuseppe Attardi, Andrea. Buzzelli, Daniele Sartiano.         Sameer Pradhan, et al. 2014. SemEval-2014 Task 7:
   2013. Machine Translation for Entity Recognition              Analysis of Clinical Text. Proc. of the 8th Interna-
   across Languages in Biomedical Documents. Proc.               tional Workshop on Semantic Evaluation (SemEval
   of CLEF-ER 2013 Workshop, September 23-26,                    2014), August 2014, Dublin, Ireland, pp. 5462.
   Valencia, Spain.                                           Dietrich Rebholz-Schuhmann et al. 2010. The
Giuseppe Attardi, Vittoria Cozza, Daniele Sartiano.              CALBC Silver Standard Corpus for Biomedical
   2014. UniPi: Recognition of Mentions of Disorders             Named Entities - A Study in Harmonizing the Con-
   in Clinical Text. Proc. of the 8th International              tributions from Four Independent Named Entity
   Workshop on Semantic Evaluation. SemEval 2014,                Taggers. In Proc. of the Seventh International Con-
   pp. 754–760                                                   ference on Language Resources and Evaluation
Giuseppe Attardi, Luca Baronti. 2014. Experiments in             (LREC'10). Valletta, Malta.
   Identification of Temporal Expressions in Evalita          Dietrich Rebholz-Schuhmann, et al. 2013. Entity
   2014. Proc. of Evalita 2014.                                  Recognition in Parallel Multi-lingual Biomedical
Adam L. Berger, Vincent J. Della Pietra, and Stephen             Corpora: The CLEF-ER Laboratory Overview.
   A. Della Pietra. 1996. A maximum entropy ap-                  Lecture Notes in Computer Science, Vol. 8138,
   proach to natural language processing. Comput.                353-367
   Linguist. 22, 1 (March 1996), 39-71.                       RIS: Ricerca e innovazione nella sanità. 2014. POR
Olivier Bodenreider. 2004. The Unified Medical Lan-              RIS of the Regione Toscana. homepage:
   guage System (UMLS): integrating biomedical                   http://progetto-ris.it/
   terminology. Nucleic Acids Research, vol. 32, no.          SCIKIT. 2014 Retrieved from http://scikit-learn.org/
   supplement 1, D267–D270.                                   SENNA. Semantic/syntactic Extraction using a Neu-
Ronan Collobert et al. 2011. Natural Language Pro-               ral Network Architecture. 2011. Retrieved from
   cessing (Almost) from Scratch. Journal of Machine             http://ml.nec-labs.com/senna/
   Learning Research, 12, pp. 2461–2505.                      Jannik Strotgen and Michael Gertz. Multilingual and
CRF-NER.          2014.              Retrieved   from:           Cross-domain Temporal Tagging. 2013. Language
   http://nlp.stanford.edu/software/CRF-NER.shtml                Resources and Evaluation, June 2013, Volume 47,
EUROPARL. European Parliament Proceedings Par-                   Issue 2, pp 269-298.
   allel         Corpus        1996-2011.        2014.        Buzhou Tang et al. 2013. Recognizing and Encoding
   http://www.statmt.org/europarl/                               Discorder Concepts in Clinical Text using Machine
Jenny Rose Finkel, Trond Grenager and Christopher                Learning and Vector Space Model. Workshop of
   D. Manning 2005. Incorporating Non-local Infor-               ShARe/CLEF eHealth Evaluation Lab 2013.
   mation into Information Extraction Systems by              Buzhou Tang, et al. 2014. Evaluating Word Represen-
   Gibbs Sampling. In Proc. of the 43nd Annual                   tation Features in Biomedical Named Entity
   Meeting of the ACL, 2005, pp. 363–370.                        Recognition Tasks. BioMed Research Internation-
Neil Fraser. 2011. Diff, Match and Patch libraries for           al, Volume 2014, Article ID 240403.
   Plain Text.

                                                         21
Jorg Tiedemann. 2009. News from OPUS - A Collec-              Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
   tion of Multilingual Parallel Corpora with Tools              Introduction to the CoNLL’03 Shared Task: Lan-
   and Interfaces. In Nicolov et al., eds.: Recent Ad-           guage-Independent Named Entity Recognition. In:
   vances in Natural Language Processing. Volume                 Proc. of CoNLL '03, Edmonton, Canada, 142–147.
   V. John Benjamins, Amsterdam/Philadelphia, Bo-             WORD2VEC.             2014      Retrieved    from
   rovets, Bulgaria, pp. 237–248.                                http://code.google.com/p/word2vec/

                                                         22
You can also read