Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links

Page created by Pedro Harrington
 
CONTINUE READING
Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links
Papillon Lexical Database Project:
                 Monolingual Dictionaries & Interlingual Links
                               Gilles Sérasset, Mathieu Mangeot
                                       GETA-CLIPS IMAG
                              University Joseph Fourier — Grenoble 1
                             BP 53, 38041 Grenoble cedex 9, FRANCE
                    Gilles.Serasset@imag.fr, Mathieu.Mangeot@imag.fr

                                                   in very small dictionaries. Also, dictionaries
                  Abstract                         never contain numeric specifiers, which are as
    This paper presents a new research             important in Japanese as gender and number in
    and development project called                 French. On the other hand, the information
    Papillon. It started as a French-              available in paper dictionaries does not exist in
    Japanese cooperation between                   machine-readable forms, or is not accessible on
    laboratories GETA/CLIPS (Grenoble,             line.
    France) and NII (Tokyo, Japan). Its            The lack of bilingual resources is also an
    goal is to build a multilingual lexical        obstacle to develop linguistic software
    database and to extract from it digital        applications, for which adapted dictionaries are
    bilingual dictionaries.                        a need. As an example, Nippon Telegraph and
    The database is based on monolingual           Telephone in Japan or Lexiquest in France have
    dictionaries, one for each language of         to develop their own dictionaries in a separate
    the database, linked to an interlingual        and time-consuming effort. In the academic
    dictionary.                                    world, this implies that applications that have
    From the lexical database, it is               been created for French and Japanese offer only
    planned to derive user customized              a reduced scope, while good English-Japanese
    bilingual dictionaries in multiple             pieces of software are available.
    target formats. It will be possible to         Nevertheless, it is a true fact that Japan is very
    generate human usage dictionaries as           interested in the French language. Conversely, a
    well as specialized dictionaries for           growing number of French individuals invest
    machine translation software. These            much energy to learn Japanese. There is a
    dictionaries will be available under           vacuum to be filled.
    the terms of an open source license.           The leveraging of communication that Internet
    This project, initiated by some                offers allows one to think that a convenient
    computational linguists, aims at being         digital dictionary could be produced by a
    useful and open to all those who are           general cooperation between linguists,
    interested in Japanese and French. It          translators, computer scientists, etc., working
    is also opened to any other language.          together through Internet.
    Moreover, the pivot architecture of            A similar project between English and Japanese
    the database will facilitate the               has been active for about a decade. This project
    addition of new languages and save             has allowed the effective building of a free
    translation efforts.                           Japanese-English dictionary, available through
                                                   an Internet server. This Edict project has been
                                                   created and supported by Pr. Jim Breen from
1. Introduction                                    Monash University, Australia (see bookmark 2).
                                                   The current JMDict dictionary comprises now
There are few French-Japanese usage                70,000 entries of common vocabulary, a specific
dictionaries, which are really usable and useful   kanji dictionary, and around twenty specialized
for French speakers. The main problem is that      dictionaries (biology, law, etc).
the original Japanese script and the rômaji        A different project, fed by volunteers, is
phonetic transcription are present together only   supported by NEC Corporation. Its aim is to
increase the dictionaries used by the NEC                 3. Internal Architecture of the
translation tool (see bookmark 3), and to bring              Database
in new entries on a constant way.
We should also mention the SAIKAM project                 The database will be built using a pivot
[1], (see bookmark 5) cooperation between NII             architecture based on Dr. G. Sérasset’s Ph.D.
(Tokyo, Japan) and NECTEC (Bangkok,                       thesis (Sérasset 1994) and experimented by Dr.
Thailand) active since about 5 years, where Thai          E. Blanc in PARAX (Blanc 1999). The
students working or having worked in Japan                monolingual dictionaries will be linked only
have built a sizable Japanese-Thai online
dictionary through Internet.
                                                           
In such a context, the GETA/CLIPS laboratory                 
(Grenoble, France) and the National Institute of               
database. Here are described the architecture of             
                                                             
the database, the structure of the entries and the             
methodology adopted for the project.                             
                                                               
2. General View of the Database                              
                                                           
The lexical database is built on the one hand by
integrating existing resources and on the other             Example of an interlingual acception encoded in
                                                                                 XML
hand by writing and correcting new entries (see
Figure 1).                                              through a pivot dictionary of interlingual links
                                                        called acceptions. These acceptions will also be
Once the database is homogeneous, users will be         linked together by refinement links. They may
able to extract their own customized dictionaries       also be translated into the UNL language (UNL
                                                                                  1996), (see bookmark
       User      User         User                                                6).
                                                      Interaction with            Each sense or meaning
                                                      the Dictionaries            of each entry of a
       Dictionary Dictionary                                                      monolingual dictionary
                                                                                  is linked to one or more
                                                      Extraction of               acceptions of the pivot
                                                      Dictionaries                dictionary.         For
                  Lexical                                                         example, in French
                                                                                  “ carte ” has two
                 Database                                                         meanings: “ carte à
                                                                                  jouer (card) ” and
                                                      Integration of              “ carte géographique
                                                                                  (map) ”. The entry
                                                      existing resources “ c a r t e ”                will
                                                                                  consequently be linked
   Resource       Resource           Resource
                                                                                  to     two     "lexies"

                         Figure 1. General architecture                           (corresponding to 2
                                                                                  word senses) in the
dynamically from the database and to interact             French monolingual dictionary, which in turn
with them.                                                will be linked to 2 acceptions in the pivot
                                                          dictionary: in the example, the first has number
                                                          343, with the corresponding UNL "UW"
                                                          (universal word) “card(icl>play)”, and the
Figure 2. Lexical architecture of the papillon database
second one has number 345, with UW                       7. Examples: La mésentente pourrait
“map(fld>geography)” (see Figure 2).                     être le mobile du meurtre.
                                                         8. Full idioms: _appel au meurtre_
                                                                           _crier au meurtre_
4. Structure of the monolingual
                                                         We chose to encode this dictionary in XML (see
   dictionaries
                                                         Annex). With this choice, we are able to
The structure of the entries or microstructure of        manipulate the dictionary structure using open-
the monolingual dictionaries is based on the             source tools XSLT processors and DOM parsers
structure used for the formal lexical database           (such as xalan and xerces from the « apache »
DiCo (Polguère 1998) of the OLST laboratory in           project, see bookmark 7).
Université de Montréal. The encoding
methodology is directly borrowed from the                5. Building methodology
explanatory and combinatorial lexicology,
which is part of the meaning-text theory                 The building methodology of the lexical
(Mel’cuk 1997).                                          database builds on one hand on the reuse of
1. Name of the lexical unit: MEURTRE                     existing data, the French-English-Malay
2. Grammatical properties: nom, masc                     dictionary (Gut et al. 1996), (see bookmark 1)
                                                         and the Japanese-English dictionary of Jim
3. Semantic formula: action de tuer:          _

PAR L’individu X DE L’individu Y                         Breen (see bookmark 2), and on the other hand
4. Government pattern: X = I = de N, A-                  on the contribution of volunteers working
poss Y = II = de N, A-poss                               through the Internet.
5. (Quasi-)synonyms: {QSyn}                              Different steps are planned: The first step is the
assassinat, homicide#1; crime                            integration of existing resources. It consists in
6. Semantic derivations and collocations: {V0}           preparing a "lexical soup" by merging the two
tuer                                                     dictionaries thanks to the presence of English.
        {A0} meurtrier-adj                               This merging operation will produce correct as
        {S1} auteur [de ART ]         _
                                                         well as incorrect acceptions (interlingual links).
//meurtrier-n /*Nom pour X*/                             These wrong acceptions will be corrected or
                                                         deleted by lexicologists.
Then the voluntary contributors will index new         encourage voluntary contribution. These
entries and the lexicologists will correct and         contributions will be accepted via a “community
integrate them into the database. It will create a     web site” where any user should be able to :
cycle of edition/correction/modification of the            • Communicate and discuss about the
entries between the lexicographers/contributors               available material
and the lexicologists. Different kind of                   • Consult dictionaries
contributors can work on the database:                     • Correct dictionaries
    • specialists of one language will write the           • Contribute
       monolingual entries;                                   - By giving personal dictionaries
    • people with good knowledge of French                    - By providing entries to other
       and Japanese like translators will work on                 dictionaries
       the links between the monolingual entries       A mockup of this web site is currently
       and the acceptions;                             developed in Java that integrates web services
    • people with good knowledge of UNL will           with XML processing.
       translate the acceptions into UNL (UNL          About 300 detailed French and some Japanese
       1996).                                          entries are currently integrated in the database
                                                       (in XML form).
6. Encourage voluntary contributions                   The web side gives access to these entries by
                                                       providing the users with a dynamically
   One of the main idea in Papillon project is to      generated form obtained via XSL

                  Figure 3. Web page dynamically generated for French entry "meurtre" (murder)
transformations.                                      Electronics and Computer Technology Center
Different XSL transformations are provided that       (NECTEC/Thailand).
give access to different views. Up to now, we         The open source license makes all the data
only develop a complete view inspired by the          available to anyone. Furthermore, we will be
Explanatory and Combinatory Dictionary (ECD)          able to generate multiple formats from the
developed by Igor Mel’cuk (see Figure 3).             lexical database.
                                                      Finally, it should be stressed that such an
                                                      endeavor will not only need the dedication of as
7. Dictionaries produced                              many volunteer contributors as possible, but
                                                      some stable support, in the form of a server and,
Several monolingual or bilingual dictionaries         more difficult, of a central team of experts
can then be extracted from the database.              charged of "refining the raw ore" of individual
Different types are needed: for human use, via        contributions.
database and plug in functionalities or via usual     That team does not have to be in a single place,
dictionary formats, and for machine use.              but convenient groupware tools should be
                                                      developed for it.
1.1   For human use, via database and                 References
      plug in functionalities
                                                      Vuthichai Ampornaramveth, Akiko Aizawa,
    Persons that interact in foreigner languages      Keizo Oyama, Tasanee Methapisit (2000)
often can access computers. One of the aims of        Implementation of an Internet-Based Dictionary
this dictionary is then to provide them with a        Development Environment: SAIKAM (in
direct help, within their editor, browser, or their   Japanese) Research Bulletin of the National
daily used personal digital assistant.                Center for Science Information Systems vol.12,
                                                      p.101-109 (2000)
1.2   For human use, via usual dictionary             Blanc Étienne (1999) PARAX-UNL: A large
      formats                                         scale hypertextual multilingual lexical database.
We plan to automatically derive from the              Proceedings 5th Natural Language Processing
database digital presentations for web                Pacific Rim Symposium 1999, Tsinghua
consultation and paper edition. The FeM (see          University Press, Beijing, 1999, p.507-510.
bookmark 1) and JMDict (see bookmark 2)               Gut Yvan, Puteri Rashida Megat Ramli,
formats are the first targeted formats.               Zaharin Yusoff, Chuah Choy Kim, Salina A.
                                                      Samat, Christian Boitet, Nicolas
1.3   For machine use                                 Nédobejkine, Mathieu Lafourcade et al.
                                                      (1996) Kamus Perancis-Melayu Dewan,
The terminology resources available for building      dictionnaire français-malais. Dewan Bahasa
lingware (linguistic software) are almost null        Dan Pustaka, Kuala Lumpur, 667 p.
between Japanese and French. The rare available       Mel'cuk Igor A. (1997) Vers une linguistique
ones have to be radically restructured and            Sens-Texte. Leçon inaugurale, Collège de
augmented. The orientation of the Papillon            France, Chaire internationale, 43 pages.
lexical database towards possible use by              http://www.fas.umontreal.ca/LING/olst/FrEng/
machines will encourage the realization of            melcukColldeFr.pdf
lingware including both languages, by providing       Polguère Alain (1998) La théorie Sens-Texte.
a first support for such projects.                    Dialangue, Vol. 8-9, Université du Québec à
                                                      Chicoutimi,                 pp.          9-30.
8. Conclusion                                         http://www.fas.umontreal.ca/LING/olst/FrEng/P
                                                      olgIntroTST.pdf
The pivot architecture allows an easy integration     Sérasset Gilles (1994) Interlingual Lexical
of new languages because the reuse of existing        Organisation for Multilingual Lexical
links will save a lot of time consuming efforts.      Databases in NADIA, COLING-94, 5-9 August
The Thai language is already about to integrate       1994, vol. 1/2 : pp. 278-282.
the project through a cooperation with Kasetsart      Tomokiyo Mutsuko, Mathieu Mangeot &
University (KU/Thailand), and National                Emmanuel Planas (2000) Papillon : a Project
of Lexical Database for English, French and          [b2] JMDict Japanese->English:
Japanese, using Interlingual Links. Journées            http://meshplus.mesh.ne.jp/CRV2/dic/club/do
Science et Technologie de l'ambassade de                wn.html
France au Japon, 13 Novembre 2001, Tokyo,            [b3] NEC project:
Japon, 3 p.                                             http://meshplus.mesh.ne.jp/CRV2/dic/club/do
UNL (1996) Universal Networking Language.               wn.html
UNL center, Institute of Advanced Studies, The
UN University, 1996, 74 p.                           [b4] Papillon Project:
                                                        http://vulab.ias.unu.edu/papillon/index.html
Bookmarks                                            [b5] SAIKAM Project: http://saikam.nii.ac.jp
[b1] FeM Dictionary: http://www-                     [b6] UNL Project: http://www.unl.ias.unu.edu
   clips.imag.fr/geta/services/fem                   [b7] Apache project tools:
                                                        http://xml.apache.org/

Annex
Excerpt of the XML structure that encodes the French entry « meutre » (murder)

meurtre
  meu+rtr(e)
  n.m.
  action de tuer: ~ PAR L'
    individu
    X DE L'
    individu
    Y
  
      XI
        de N
                       A-poss
      
        YII
        de N
                       A-poss
  
         assassinat
         homicide$2
         crime
       
         tuer 
      
    …
    
      Très choquant
        atroce
        affreux
        brutal
        horrible
        inqualifiable
        odieux
      
    …
C'est ici que le double meurtre a été commis.
    Soupçonné du meurtre de son épouse, il a été arrêté par les gendarmes
mercredi.
    Il devrait comparaître aux assises dans trois semaines comme auteur
présumé du meurtre d'un quinquagénaire.
    La mésentente pourrait être le mobile du meurtre.
  
    _appel au meurtre_
    _crier au meurtre_
You can also read