Representation Learning of Documents Driven by Knowledge Resources - IRIT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Representation Learning of Documents
Driven by Knowledge Resources
Lynda Tamine
University Paul Sabatier
Institut de Recherche en Informatique de Toulouse IRIT
e-mail: lynda.lechani@irit.fr
http://www.irit.fr/~Lynda.Tamine-Lechani/
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019Objectives
- Introduce the semantic gap problem in information retrieval (IR)
- Design document representation learning models: combine distributional
semantics and human-established semantics provided by external structured
knowledge resources
- Compare online vs. offline representation learning strategies on IR and
Natural Laguage Processing (NLP) tasks
- Provide lessons for future representation learning frameworks
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 2Outline
I- The semantic gap problem in IR
II- Representation learning of documents driven by external knowledge resources
-- Online learning strategy
-- Offline learning strategy
III- Empirical evaluation on IR and NLP tasks
IV- Lessons learned and implications
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 3The semantic gap: a longstanding research issue in IR I | II | III | IV
Anatomy of an IR process
Information need
Main issues (in query-document
matching)
- Lexical gap
aspirin vs aceltylsalid acids
- Granularity mismatch
Query text Document text aspirin, anacardic acid vs salicylates
- Polysemy
bass (fish vs part of harmony)
Generate query Generate document
representation representation
Manually designed
features
Query vector Doc. vector
Term
Fequency
Term position
Estimate relevance Length
Impact on relevance estimate
..
ln
N − df (w) + 0.5 (k1 +1) × c(w,d ) (k +1) × c(w,q) - Default sense matching between queries
∑ ⋅ ⋅ 3
w∈q∩d df (w) + 0.5
k1 ((1− b) + b
|d |
) + c(w,d )
k3 + c(w,q) and documents
avdl
- Low levels of retrieval performance
Learning 2 Rank (SVM, NN, ..) (effectiveness in termes of recall/precision
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 4The semantic gap: a longstanding research issue in IR I | II | III | IV
The semantic gap in medical IR: a review of the TREC medical search track (Edinger et al. 2012)
False
negative
Task: clinical search cohort
Query: expression of
disease/consitions sets and
treatments or intervention
Eg., "find patients with
gastroesophgeal reflux disease False
who had an upper endoscopy" positive
Documents: de-identified
medical visit reports
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 5Research directions I | II | III | IV
How to tackle? Hybrid models for document representation learning
Human-established semantics Distributional semantics
Generate
Generate query
document
representation
representation
Estimate
Relevance
Complementarity Human-established
AND Distributional semantics
semantics
Lexical gap ++ ++
Granularity mismatch ++ +
Polysemy + -
Word pair relation inference - +
Sense readability ++ -
Domain adaptability - +
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 6Research directions I | II | III | IV
Our approach: (human) Knowledge-enhanced representation learning of documents
Knowledge-enhanced representation learning
Joint learning of embeddings (Online learning)
word concept document with relational
Embeddings learning
embedding embedding embedding constraints
[Liu et al., 2016] X X
[Yu etDredze, 2014] X X
[Jauhar et al., 2015] X X
[Liu et al., 2018] X X X
[Mancini et al., 2016] X X
[Cheng et al., 2015] X X
[Yamada et al., 2016] X X
Our model X X X X
Retrofitting of embeddings (Offline learning)
[Faruqui et al., 2014] X X
[Glavas et Vulic, 2018] X X
[Mrksic et al., 2016] X X
[Jauhar et al., 2015] X X
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 7Research directions I | II | III | IV
Our approach: (human) Knowledge-enhanced representation learning of documents
To the best of our knowledge, no strongly related work (in IR)
Zhang e et al. (2018) Neural Information Retrieval: A literature Review
Extracted from the review paper Information Retrieval Journal, June 2018, Volume 21, Issue 2–3, pp 107–
110|
Nguyen et al. (2016) Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf:
Toward a Deep Neural Approach for Knowledge-Based IR, Worshop on Neural IR, in
Conjonction with SIGIR'2016
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 8Document representation learning of documents I | II | III | IV
Our general approach: (human) Knowledge-enhanced representation learning of documents
Retroffing document Joint learning of word, concept, document
embeddings (offline learning) embeddings (Online learning)
Structured knowledge resources:
UMLS, YAGO, WordNet, DBPedia, ..
Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf: Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf:
Learning Concept-Driven Document Embeddings for Medical A Tri-Partite Neural Document Language Model for Semantic
Information Search. Information Retrieval.
Artificial Intelligence in Medecine (AIME) 2017: 160-170 Extended Semantic Web Conference (ESWC) 2018: 445-461
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 9Document representation learning of documents I | II | III | IV Offline representation learning of documents: driving idea and Illustration in the medical domain Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 10
Document representation learning of documents I | II | III | IV
Representation learning driven by relational semantics: enhance the readability of the learning outcomes
Relational constraints:
C1: Constrain the distributional learning model towards beter revealing paradigmatic word-
word relations based on word-concept relations established in the knowledge resource.
C2: Favour the learning of syntagmatic similarity relations between words linked to related
concepts in the knowledge resource through concept-concept relations.
Make the vectorial representations of related words/concepts in the knowledge resource, close
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 11Document representation learning of documents I | II | III | IV
Representation learning driven by relational semantics: enhance the readability of learning outcomes
Retroffing document Joint learning of word, concept, document
embeddings (offline learning) embeddings (Online learning)
Structured knowledge resources:
UMLS, YAGO, WordNet, DBPedia, ..
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 12Experimental validation I | II | III | IV Experimental set up Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 13
Experimental validation I | II | III | IV
Results: comparing the quality of document embedding using offline vs. online learning strategies
Main general observations and trends:
O1. Offline models generally achieve better results for NLP similarity tasks than NLP classification tasks
O2. Offline models are more effective in NLP within general domains, while online models are more
effective in the medical domain
O3. Offline and online models behave similarly in IR tasks while being more effective in medical search tasks
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 14Experimental validation I | II | III | IV
Results: Focus on query expansion in offline learning
Main observation:
O4. Query expansion in offline models is more effective for medical search tasks
Ohsumed 07 Ohsumed 35
young woman with lactase deficiency 26 yo female with bulimia
young woman with lactase deficiency fibrosis abscess 26 yo female with bulimia hypertension failure
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 15Experimental validation I | II | III | IV
Results: measuring the impact of considering relational constraints in both offline vs. online learning
Main observations and trends:
O4. Considering relational constraints is more effective in both NLP and IR tasks
O5. Considering both word and concept relations is more effective than considering one type of relations
Document re-ranking Query expansion
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 16Conclusion and Implications I | II | III | IV
Lessons learned
• Performance of online and offline learning models are both
task and domain-dependent
o Online models are more effective in identifying similarity signals
o Both online and offline models are effective in identifying relevance
signals
o Both offline and online models are more effective in medical IR tasks
than in general domain-search tasks. This is reversed in the case of NLP
tasks
• Relational knowledge is useful for driving the distributional
learning in both NLP and IR tasks
o Constraining the learning with relational knowledge is effective in both
NLP and IR tasks. The learning leverage from both word-word relations
and concept-concept relations
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 17Conclusion and Implications I | II | III | IV
Pending issues and perspectives
• (Main) Pending issues
o Robustness of the models: significant performance variation depending
on multiple factors (knowledge resource, task, annotation quality, etc.)
o Transfer the learning to new senses: particularly challenging in
specialized domains and/or with low-resource languages. Cross-domain
performance is important in IR
• Perspectives
o Consider the relation types in the learning objective to better map the
vectorial representation with the knowledge resource
o Constrain the learning with multiple (heterogeneous) structured
knowledge resources
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 18Lynda Tamine, IRIT, UPS Laure Soulier, LIP6, Sorbonne Université Gia Nguyen, IRIT, UPS Nathalie Souf, IRIT, UPS
Thank You
Lynda Tamine
University Paul Sabatier
Institut de Recherche en Informatique de Toulouse IRIT
e-mail : lynda.lechani@irit.fr
http://www.irit.fr/~Lynda.Tamine-Lechani/
Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019You can also read