WP3 : entity-fishing service - Presented by Tanti Kristanti (INRIA - Paris) For the HIRMEOS Final Workshop 2 June 2019 Marseille, France
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
WP3 : entity-fishing service
Presented by Tanti Kristanti (INRIA – Paris)
For the HIRMEOS Final Workshop
2 June 2019
Marseille, Franceentity-fishing (1)
• An open source tool composed of services to automate the entity recognition and
disambiguation against Wikidata 1
• It is not restricted and not limited for special domains, classes of entities or
usages 2
• Initially developed within the FP9 CENDARI (Collaborative European Digital
Archive Infrastructure) project 3
• Continued to be developed within the H2020 HIRMEOS (High Integration of
Research Monographs in the European Open Science Infrastructure) project to
enrich open access digital monographs published on five digital platforms 4
• Deployed as part of the national infrastructure Huma-Num in France
• A stable online service within the DARIAH-EU infrastructure, the European digital
research infrastructure for the arts and humanities
• Distributed under Apache 2.0 license
1 Science-Miner, Entity disambiguation, http://science-miner.com/entity-disambiguation/, (accessed 6 May 2019)
2 Patrice Lopez, Overview: Motivation, 2019, https://nerd.readthedocs.io/en/latest/overview.html, (accessed 6 May 2019)
3 Patrice Lopez, Alexander Meyer, Laurent Romary. CENDARI Virtual Research Environment & Named Entity Recognition techniques. Grenzen überschreiten – Digitale Geisteswissenschaft heute
und morgen, Feb 2014, Berlin, Germany, https://hal.inria.fr/hal-01577975, (accessed 6 May 2019)
4 OAPEN, End user services: Named Entity Recognition and Disambiguation, http://www.oapen.org/content/services-end-user-services, (accessed 6 May 2019)entity-fishing (2)
• Current version (0.0.3) supports English, French, German, Italian and Spanish
• Based on machine-learning techniques (Gradient Tree Boosting, CRF, word and
entity embeddings)
• For English and French, a Name Entity Recognition based on CRF Grobid-NER in combination
with the disambiguation
• Library for machine learning uses SMILE ML
• Knowledge base contains
• 37 million entities 154M statements from Wikidata
• 15 millions word and entity embeddings
• Project repositories: https://github.com/kermitt2/entity-fishing
• Demo: http://nerd.huma-num.fr/nerd/
• Documentation: https://nerd.readthedocs.io/en/latest/How to use entity-fishing services ?
Response of the service
• Through REST API Query parameter to be sent to the
• Service can be applied on 4 types of input 1: service
• text
• search query
• weighted vector of terms
• PDF document
• REST query
• POST /disambiguate
• POST /language
• POST /segmentation
• POST /customisations
• GET /kb/concept/{id}
• GET /kb/term/{term}
• GET /language?text={text}
• GET /segmentation?text={text}
• GET /customisations
• GET /customisation/{name}
• PUT /customisation/{profile}
• DELETE /customisation/{profile}
1 Patrice Lopez, entity-fishing REST API, 2019, https://nerd.readthedocs.io/en/latest/restAPI.html, (accessed 13 May 2019)WP3 Works
• Deployment and integration of entity-fishing services in the
partners’ open access platforms.
• The approach : reusability and code sharing
• Process the following data:
• 4 000 books in English and French from Open Edition
• 2000 titles in English and German from OAPEN
• 162 books in English from Ubiquity Press
• 765 books (606 in German, 159 in English) from UGOE
• Result (entity-fishing clients in Java, Python, PHP) under licence APACHE 2.0
• entity-fishing-client-python: python client for entity-fishing service
• entity-fishing client-php-oe: php client for entity-fishing service by OpenEdition
• entity-fishing-client-php: php client for entity-fishing service by EKT
• entity-fishing-client-oapen: integration scripts with the OAPEN infrastructure by OAPEN
• For validation measures needs:
• Use a CC-BY gold standard HIRMEOS corpus
• Containing a set of thousands manually corrected Named Entity Recognition and Disambiguation entities with
Wikidata identifier (not present in any of the corpuses already existing (e.g. IITB, AQUAINT)
1 High Integration of Research Monographs in the European Open Science Infrastructure (HIRMEOS), WP3 NERD Work Package Validation, (accessed 6 May 2019)
2 Hirmeos Github, https://github.com/HirmeosThe OpenEdition Books publishing
• entity-fishing PHP client is created and
integrated into Core processes data for
enrichments
• Fetch entities as results of requesting the entity-
fishing API services for chapters
• Entities are classified as PERSON and LOCATION
• Aggregate the entities results at books level
• Location and Person entities at book and chapter
level are stored in the SolR Index
• Two facets for Persons and Location are added to
the front-end interfaceUGOE-SUB
• entity-fishing is integrated into the
publishing workflow of Göttingen
University Press (GUP) to enable the
semi-automatic indexing of its
monographs
• Titles, abstract and metadata of the
monographs are processed by entity-
fishing API to identify and categorize the
named entities
• Different named entities are classified
into different classes : PERSON,
LOCATION and ORGANIZATION
• Show how often every singular entity
occurs
• The indexed data are displayed as facets
which are made available to users as
« Keywords »; It allows users to quickly
find the monographs by the entities
appearedEKT / National Documentation Center
• The current release of OMP does not support any annotation service and EKT has
improved OMP with entity-fishing support
• Integrating entity-fishing API service to the Open Monograph Press (OMP) monographs’
landing page to annotate the abstract
• Two phases of implementation :
• Create a PHP client that acts as a wrapper above entity-fishing service by hiding its complexity to the
user;
• Hiding the complexity of HTTP protocol
• The JSON result of entity-fishing service is wrapped to high level class objects
• Integrate the client into the OMP Software.Ubiquity Press (UB) • Developed an internal service to receive notifications from the existing company platform when a new article has been published and POSTs its content to the entity- fishing API to retrieve all the entities and store them locally. • The entities are shown to the reader as clickable links referring to the Wikipedia entry.
OAPEN
• Create some scripts to :
• Call entity-fishing service with 1) path to PDF and 2) API URL as arguments
• Storing the entity-fishing response locally
• Combine the entity-fishing results with the unique identifier of the book or chapter
in the OAPEN Library
• Export of the database to CSV
• OAPEN plans to make the data available as a CC0 licensed file, which will be
published on the OAPEN Library metadata pageYou can also read