Applying Oxford-PWN English-Polish Dictionary to Machine Translation

Applying Oxford-PWN English-Polish Dictionary to Machine Translation

                                          Jassem Krzysztof

        Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań

The paper reports on the process of converting          2. Building a lexicon for an MT system
a large human-readable English-Polish                   with the Polish language
dictionary further abbreviated as OPEP                  Two approaches for building an MT lexicon
(Oxford-PWN English-Polish), to the format              are mainly discussed in the literature. In
applicable for machine translation. The report          Pinkham and Smets (2002) the authors
concludes with an evaluation of the                     distinguish between "HanC systems" – based
Translatica system which uses the converted             on "Hand-crafted Dictionary" and the "Lead
data in transfer-based translation.                     systems" – based on "Learned Dictionary".
                                                        Systems of the first type use traditional
1. Translatica system                                   bilingual dictionaries to determine the transfer
The Translatica system originates from the              between word senses, whereas in "Lead
PolEng project developed in 1996-2002 at                systems" the transfer part of the dictionary is
Adam Mickiewicz University (AMU) in                     trained on bilingual text corpora. The authors
Poznań. PolEng developed a transfer-based               of the publication show the advantages of the
system translating texts from Polish into               "Lead approach" – particularly for new pairs of
English, with the lexicon based on Internet             languages and limited time for development.
texts. Translatica expands PolEng capabilities                   By contrast, the experiments of Ilaraza,
by translation in the reverse direction and the         Mayor, and Sarasola (2001) demonstrate the
usage of broader lexicon obtained from the              advantages of building an MT lexicon on the
contents of OPEP.                                       basis of large traditional dictionaries. The
     The main features of PolEng inherited by           authors compare translation produced by the
Translatica are:                                        system that uses raw bilingual dictionary to tha
     ƒ bottom-up parsing based on the CYK               given by the system whose dictionary is a
         algorithm                                      merge of a Basque lexical database and the
     ƒ phrasal structure representation                 Morris bilingual English-Basque dictionary.
     ƒ Perl-like formalism of transfer and              The authors state that the output of the latter
         synthesis rules.                               system is distinctly better.
It is worth noting that the same formalism for                   Another dichotomy is mentioned by
description is used for both directions                 Baldwin, Hutchinson and Bond (1999). On the
         The formalism for the description of           basis of English-Japanese translation the
grammars is a kind of CFG. The authors have             authors compare the systems that store entries
tried to use available resources to describe the        as source/target language pairs to those which
English grammar, e.g. AGFL (2002)but it                 consider both languages separately. In the
turned out that they would hardly comply with           source/target approach "a word has as many
the elaborated translation engine (see Graliński        senses as it has translation equivalents". In the
(2002) for details on the engine). In order to          latter approach sense distinctions are specific
use the engine it was necessary for us to               for each language, which in the authors'
compose grammar rules ourselves.                        opinion is "more cognitively justifiable". The
     According to the agreement between the             main argument for the "monolingual" approach
authors of PolEng and the authors of OPEP,              in MT is that the decision on the sense
the lexicon for the English-to-Polish direction         disambiguation may be postponed until the
should be based on the OPEP contents. The               process of generation whereas in the "bilingual
paper reports the work that was done in 2003            approach" the semantic constraints on the
in order to adopt OPEP contents to Translatica          source side are used for disambiguation in both
– the PolEng successor.                                 source analysis and transfer. Another argument

against "bilingual" dictionaries is their uni-                             lexicon developed by the
directionality.                                                            organization       of     Polish
         One of the main criteria for choosing                             Scrabble      players,   further
the type of lexicon for an MT application is the                           referred to as the scrabble
availability of resources. Because of the                                  dictionary
scarcity of aligned Polish-English bi-texts the                 ƒ other        dictionaries    of    Polish
"Lead approach" has lost its main benefit for                              published by PWN, e.g. Bańko
our purposes: rapid deployment – in order to                               (2000)
use the approach it would be necessary to build                 ƒ lists of entries (e.g. proper nouns)
aligned corpora (the same reason has                                       from PWN encyclopedias.
determined the choice of the transfer method             The PolEng lexicon comprises some
used for the translation, rather than the corpus-        syntactical information that could prove useful
based one). On the other hand, our group have            for the other direction; the scrabble dictionary
had to free disposal quite large traditional             handles Polish inflection exhaustively; Bańko
dictionaries: OPEP English-Polish dictionary             (2000) includes some information on
and a few Polish dictionaries mentioned in               syntactical features of entries, in the form well
section 4. The situation called for the "HanC            suited for computer processing.
approach".                                               2. Lexical resources at limited disposal (via
     The decision whether the entries should be              Internet), e.g.
bilingual or monolingual was to large extend                 ƒ Meriam-Webster Dictionary (www.m-
determined by the conditions of the agreement           
between the PolEng group and PWN. The                        ƒ Internet English-Polish dictionary
dictionary publishers (as well as the authors of                  (
the system) liked the system to use the                  These and other dictionaries available on-line
linguistic knowledge included in the dictionary          helped lexicographers understand the meaning
material to the maximum extend. The system               of some entries or suggest alternative
dictionary should mirror the OPEP material as            equivalents
closely as possible. This called for the                 3. Text corpora concordancers:
bilingual description. The uni-directionality                ƒ British National Corpus
was not a counterargument either as the Polish-                   (
to-English part had already been developed.                  ƒ Collins Cobuild Corpus
3. Main goals                                                     m.html)
The main goals posed to the conversion                       ƒ WordCorp
process were:                                                     (
    ƒ to lose as little information as possible                   ml)
        from OPEP                                            ƒ PolEng Internet corpus – the corpus of
    ƒ to extract and formalize syntactic and                      Polish Internet texts collected while
        semantic information given in OPEP                        working with the Polish-English
    ƒ to supplement data with all                                 translation.
        information necessary for transfer-              Consulting such tools helped to verify
        based machine translation – in                   syntactic features of words – such information
        accordance with the Translatica                  is not given exhaustively in OPEP.
        algorithm.                                           WordNet has proved to play a key role in
                                                         assigning semantic values. The semantic
4. Resources                                             hierarchy used in Translatica is a subtree of
                                                         the WordNet lattice.
Before the start of the work, the group
                                                         4. Translation tools:
consolidated the following resources:
                                                             ƒ grammar description
1. Lexical resources at free disposal:
                                                             ƒ syntactic-semantic parser
     ƒ OPEP in the electronic form, XML
                                                             ƒ tools for transfer and synthesis.
                                                         Translation tools impose specific constraints
     ƒ PolEng lexicon (Polish-to-English)
                                                         on the type of information that should be
     ƒ Polish lexicon of inflected forms
                                                         stored in the dictionary.
                delivered by PWN – the
                                                         5. Translatica dictionary formalism

The dictionary formalism of the                     the time the paper was written the authors had
Translatica system assumes one lexicon for              converted 50 MB of on-line "bidicts" of
each direction (source/target approach). For            varying formats and the longest ABET script
example, the description of an entry in the             they needed consisted of mere twenty-four
English-to-Polish     direction     gives  the          lines.
constraints under which a word (or a phrase) is                  Although we think that a language like
translated into appropriate equivalents.                ABET may prove beneficial in dealing with
                                                        more than one dictionary of a simple format
5. Basic problem – time limitations                     we do not think that the idea would work for
As is shown in section 6, full and detailed             traditional off-line dictionaries that include
conversion of a single entry consumes a lot of          deep linguistic knowledge, rarely given in a
man-work (apart from computer work). The                systematic way. In our attempt to convert the
group could afford 27 man-months to                     dictionary we have come across so many
accomplish the task of conversion (3                    specific "little problems" that it is hard to
lexicographers, 9 months). The available time           imagine for us that any generalizing language
was not sufficient to manually elaborate each           could be of much help.
entry of OPEP (even after automatic pre-                         The script used for the conversion job
processing). The group assumed the following            was written in Perl. The section lists the major
approach:                                               problems in the conversion (and does not
    • function words should be described                mention those "little nuisances" which are
        almost from scratch                             particularly not amenable to description in a
    • out of 55 900 entries in OPEP, ca 20              language of a higher level).
        000 most frequent ones (according to                     The task consisted in the following
        BNC)       should     be    conversed           steps:
        automatically and then elaborated               Separating entries
        manually                                        In the approach suggested in Mayfield and
    • the rest of the dictionary should be              McNamee (2002) the macro HEADER-FIND
        conversed only automatically                    is responsible for separating entries. The macro
    • errors resulting from manual and                  identifies    headers      according    to   the
        automatic conversion should be                  specification of the dictionary. Such approach
        corrected semi-automatically.                   would not solve the problems that we
                                                        encountered in separating entries in OPEP:
6. Processing of the dictionary                         •        Some entries have references to other
The process of dictionary conversion involved           entries. The entry upon is described in OPEP
the following stages:                                   only with the reference to the entry on (i.e.
    ƒ automatic conversion of dictionary                upon= on). In such a situation the reference
         information                                    was forwarded (the idea of reference is used in
    ƒ automatic morphological description               the Translatica dicitionary as well). Quite a
    ƒ manual description/verification of 20             few entries have references only in one of the
         000 most frequent lexemes                      senses. An example may be brainstorm, whose
    ƒ semi-automatic correction of errors               informal meaning is described as equal to
    ƒ manual correction of errors found                 brainwave (whereas other senses are described
         while testing the translation system.          separately). Such situation requires different
                                                        treatment ("copy and paste inside"). Another
6.1. Automatic conversion of dictionary                 type of reference concerns entries which are
information                                             word forms of other entries. The entry bidden
In Mayfield and McNamee (2002) the authors              has a reference to bid as its whole description,
present an interesting idea that aims at                whereas the entry built has a reference (to
simplifying the conversion of bilingual                 build) in one sense as well as its own set of
dictionaries from human-readable to computer-           equivalents for other senses. Our decision was
readable form. They have created a language             to discard the entries like bidden from the
called ABET (APL Bidict Extraction Tool)                dictionary and discard only referenced senses
that allows for automating "the processes that          in the entries like built.
are the same across most extraction tasks". At

•        Entries may have graphical variants.             Automatic acquisition of attribute values
We decided to separate such variants into                 This phase consists in extracting from OPEP
multiple entries (tha other variants received             the values of attributes needed by the
references to the first variant). There are two           Translatica translation algorithm.
ways of denoting graphical variants in OPEP:              •        The automatic procedure converts
one with the usage of braces inside words, e.g.           OPEP flexional description into Translatica
bias(s)ed, the other by means of a comma, e.g.            encoding. Some verbs in English have
baldachin, baldaquin (the comma is not used               different inflected forms for different senses
consistently, however, e.g. in the entry births,          (e.g speed, sped, sped as contrasted to speed,
marriages and deaths the comma obviously                  speeded, speeded). In such a case we decided
does not separates variants – to solve that               to divide an entry into two.
disambiguation automatically we assumed that              •        The syntactical information on the
variants of the same word must share the first            complements should be concluded from
letter). On the other hand graphical variants             various parts of the OPEP description.
that differ only with the first letter                    Transitive verbs (decoded as vt in OPEP) are
capitalization        (e.g.       balkanization,          treated in our approach as having an equivalent
Balkanization) should be merged to one entry              with a noun complement in accusative in
only. The entry backwards has a "graphical                Polish. The complementation may partially be
alternative" backward but backward itself                 derived from PREP elements which in the
constitutes a separate entry in OPEC (giving a            OPEP       XML        describes     prepositional
reference to backwards!). In cases like this the          complements. Direct complements are listed in
entries should not be duplicated during                   OPEP as OBJ elements for verbs and INDIC
                                                          elements for nouns. Complementation may and
•        Some words are indexed and treated as            should also be concluded from human-readable
separated entries in OPEP (e.g. billet as billet1         descriptions, e.g. from strings like on or about
/an order/ and billet2 /of wood/) although they           smth.
are the same parts of speech. We merge such
                                                          •        Assigning semantic attributes from
entries into one (with separate equivalents).
                                                          various tags in OPEP has been partially done
•        There are entries that lack direct               automatically. For example the values given in
equivalents (e.g. behalf) – the equivalents are           the element COLL, which describes the sense,
given only for phrases that include them. Since
                                                          (e.g. for the word speech, the senses are:
we demand each entry to have at least one non-
                                                          oration, faculty, language, subject) are
empty equivalent such entries must be
                                                          automatically converted into Translatica
automatically tagged and individually decided:
                                                          semantic hierarchy by means of a WordNet
either the entry should be removed (only
lexical phrases should remain) or an artificial
                                                          •        The attribute of context, which
equivalent should be found for the entry.
                                                          comprises "domain", "style" and "dialect" in
•        Phrasal verbs need special treatment.
                                                          Translatica should be concluded from various
We decided to separate entries that are
                                                          tags in OPEP, mainly from qualifiers..
described in OPEP as phrasal verbs from
simple verbs. This approach requires                      Merging senses
considerable caution. For example, the phrasal            This phase aims at diminishing numbers of
verb get out of is in OPEP described as a                 equivalents for entries in order to simplify
separate phrasal, but some important uses of              computations in the translation process.
the multiword get out of are mentioned also in            •        Some "second-best" equivalents are
the description of the phrasal verb get out, and          replaced: the sense they represent is discarded
in the description of the simple verb get. To             and the equivalent is inserted as a synonym of
make things worse, these uses partially                   a "better" equivalent. A sense is converted into
overlap.                                                  a synonym only if strict conditions are
As it will be seen in the sections to follow, the         fulfilled: all attributes must have same value
process of separating entries takes place also in         (this automatic procedure may be undone
the next phases of the conversion.                        during manual verification).
                                                          •        It often happens that different senses
                                                          have the same equivalents. For example, all

first five senses the verb get in OPEP include           (e.g. bulgur and bulgur wheat); if they appear
the Polish equivalent dostać. In such a case we          in the target side, the second alternative is
would like to merge the senses into one                  omitted.
(according to the paradigm that in Translatica
dictionary a word has as many senses as it has           6.2. Automatic morphological description
equivalents). In order to make the merging it            The morphological information for English
was necessary to cautiously process the values           words is given in OPEP. The main resources
of attributes, e.g. the value of complementation         that contributed to determining the Polish
attribute for the merged entry should include            inflection were the PolEng Polish-to-English
complementation patterns of all merged                   dictionary and the PWN scrabble lexicon of
senses.                                                  inflected forms.
Automatic conversion of lexical phrases
                                                         6.3. Manual verification/modification of
For most word-senses the OPEP dictionary                 data
gives examples of usage as well as idioms that
include the word-sense. Both types of phrases            The data produced by automatic conversion
were automatically copied into the list of               needed manual verification and modification.
idioms in Translatica database. It was up to the         Verification showed errors in the conversion
lexicogrpahers to distingush between the two             process – which could be corrected by mere
types and make appropriate decisions. Idioms             adjusting conversion procedures. Some
in the Translatica dictionary are described              linguistic aspects required human linguistic
with the same set of attributes as single words.         knowledge as well as consulting other sources
Some syntactical and semantic values could be            than OPEP.
obtained automatically from OPEP in the same                      An important factor was the labor
way as for single words (e.g. the human                  organization. As the lexicographic group
semantic value for an object denoted as sb).             consisted of three (occasionally four)
Still, there was more to do for lexicographers           lexicographers who were controlled by a co-
with idioms than with single words.                      ordinator it was necessary to find the way in
        Some idioms and their equivalents are            which linguistic data could be modified more
denoted in OPEP by slash marks (e.g. a                   than once and the labor could be distributed
banking/an educational ~ system bankowy/                 between persons. We decided to convert the
edukacji). Such idioms needed to be separated            data into a classical SQL database. This idea
automatically.                                           made it possible to keep the lexical database in
                                                         order even if some of the work overlapped.
Cleaning up and adjusting                                         The main aspect (in view of transfer
OPEP uses its own metalanguage which                     translation) which is not well handled by
should be parsed during conversion. For                  automatic conversion is complementation.
example, the word or serves to indicate either           OPEP delivers almost none explicit
alternative uses, alternative translations, or           information on left-hand complements
alternative complements. We have decided to              (specifiers) like prepositions (and their
treat or in the following way in the conversion          translations) that tend to precede specific
process: alternative uses should be divided into         nouns. This information should have been
separate idioms, from the alternative                    encoded manually on the basis of other
equivalents all except the first one should be           resources (e.g. PolEng dictionary). The right-
discarded, alternative complements should be             hand complements are treated in OPEP more
merged. Similar treatment is used for the                exhaustively but many of them are given
metaword also. The word beacon, in its third             implicitly in examples of usage, e.g. we'll
sense, has the following form in OPEP: (also:            never get by without him is an example given
radio ~). This should be conversed into a new            in OPEP that should, for the translation needs,
entry: radio beacon.                                     be treated as a phrasal verb get by
         The usage of words in braces should             complemented by a PP without sb.
be disambiguated in this phase also. Usually                      The lexicographers were asked to
the braces denote optional occurrence, like in           query English corpora in order to check for
the entry bulgur (wheat). If the braces appear           complements that are not listed in OPEP. This
at the source side, two entries are generated            was a hard task that required good knowledge

of English and could not be executed                   It is worth noting that the OPEP dictionary was
(semi)automatically.                                   automatically reversed to enrich the Polish-to-
        Word senses in the Translatica                 English lexicon with one-to-one equivalents.
dictionary are described semantically by means
of a subtree from the WordNet semantic                 8. Evaluation
hierarchy. OPEP delivers some hints on                 The evaluation of the Polish-to-English
semantic values of words and semantic values           translation (PolEng) was executed in the
of their complements but it was up to the              Allied Irish Bank (Dublin) in November 2003.
lexicographer to choose the most appropriate           At that time the English-into-Polish translation
ones.                                                  was not available yet. The lexicon included
The lexicographers found processing idioms             only entries developed manually from various
and phrases the most challenging. It was up to         traditional    dictionaries    (before    OPEP
lexicographers:                                        enrichment).
     ƒ to convert idioms and phrases into a                     As the system provides add-ins to MS-
        canonical form                                 Office application as well as a plug-in to the
     ƒ to distinguish phrases from word usage          Internet Explorer, the evaluation dealt with
        examples so that only the former are           types of documents specific to those
        included in the dictionary                     applications. For each type the tester selected
     ƒ to determine complementation of                 7-15 documents from the banking domain and
        idioms (basing on lexicographers'              tried to evaluate the speed, coverage of
        intuition and on-line corpora)                 words/phrases and accuracy. The testing was a
     ƒ to describe admissible gaps in phrases          "black-box" type – the tester was not capable
     ƒ to determine the syntactic role of a            of estimating which component of the
        phrase (e.g. to determine that this            translation was responsible for erroneous
        morning should be treated as an                translations. The tester used both relative and
        adverb rather than a noun).                    absolute evaluation method. PolEng was
                                                       compared to PolTrans – a translation system
6.4. Semi-automatic correction                         available on-line. Not surprisingly, PolTrans
Before that stage several types of errors were         passed the test better as far as the speed is
present in the lexicon. The errors might have          concerned, PolEng had higher numbers in
resulted from erroneous automatic conversion,          accuracy. The attempt to find the absolute
mistakes of lexicographers or the development          estimation was made in the following way: for
of description formalism during the project.           each document the number of translated words
Semi-automatic correction consisted in                 was divided by the number of all words in the
automatic search for errors (spelling and              text giving the word completeness. The tester
syntactical – inconsistent with a formalism)           then divided texts into "phrases": whole
and manual correction.                                 sentences or parts of sentences. The number of
                                                       phrases translated completely (e.g. all word
7. The dictionary status                               translated) was divided by the number of all
The status of the dictionary after the                 phrases, giving the phrase completeness
conversion process is the following:                   coefficient. For each phrase the accuracy of
    • English to Polish:                               translation was estimated in terms of
73294 lexemes; 101878 inflected forms;                 "accurate", "good", "moderate", "illegible".
120805 Polish equivalents; 79971 lexical               The number of accurate and good translation
phrases                                                divided by the number of all translated phrase
    • Polish to English:                               resulted in the accuracy coefficient.
49989 lexemes, 974989 inflected forms, 54952                    An extract of the document is
equivalents, 42596 lexical phrases.                    presented below. The range goes from worst to
                                                       best translated documents.

Type of document           Word completeness%         Phrases completeness%          Accuracy%
Word 97                    95,87 – 99,07              72,45 – 98,84                  68,37 – 93,02
Word 2000                  95,05 – 99,63              72,45 – 98,84                  59,56 – 93,02
Internet Explorer          73,78 – 98,09              46,15 – 94,23                  33,85 – 81,69
PowerPoint                 94,71 – 99,7               62,64 – 97,65                  54,9 – 81,95

Similar tests have not yet been done for                  debts) list (letter) which should be properly
English-to-Polish translation. At the time this           translated into He sent me a long letter (the
report is written, the quality of the translation         subject is not obligatory in a Polish sentence)
in this direction "looks" distinctly lower. The           might as well be interpreted as A letter sent me
main reason is worse elaboration of transfer              debts (according to the free order of
rules (less time) in that direction. Other                components in a Polish sentence). This
possible reasons are discussed in Conclusions.            problem may be easily handled by semantic
                                                          disambiguation: a letter is more likely to be an
9. Conclusions                                            object than a subject of a send action. In
The Translatica group has reached the                     translating from English, homography should
following conclusions:                                    be disambiguated as soon as possible (e.g. by
     1. Only a small part (smaller than                   statistical POS-tagging) in order to achieve
        expected) of the conversion of a                  good and robust translation.
        human-readable dictionary into a
        machine-readable lexicon could be                 References
        properly      carried     out     fully
        automatically.                                    AGFL       (2002):
     2. The quality of translation is not                      usenix02/tech/freenix/full_papers/koster/koster
        proportional to the completeness of the                _html/node4.html
                                                          Baldwin, T., Hutchinson, B. and Bond, F. (1999): A
        description in the dictionary. A larger
                                                               Valency Dictionary Architecture for Machine
        number of equivalents (if they are not
                                                               Translation.    In:     Eighth    International
        constrained strongly) often results in
                                                               Conference on Theroretical and Method-
        decreasing rather then increasing the                  ological Issues in Machine Translation: TMI-
        standard of translation.                               99, Chester, UK, pp. 207-217
     3. The dictionary description should be              Bańko, M. (2000): Inny słownik języka polskiego (A
        limited: the more information – the                    Different Dictionary of Polish), Wydawnictwo
        slower translation                                     Naukowe PWN
     4. WordNet as an entry point for                     Graliński, F. (2002): Wstępujący parser języka
        semantic description proved helpful                    polskiego na potrzeby systemu POLENG
        but it has two drawbacks: 1) the                       (Bottom-up Parser of Polish designed for the
        structure of lattice assumed in                        POLENG System) In: Speech and Language
                                                               Technology. Volume 6, Poznań 2002, [http://
        WordNet is hard to deal with
        computationally (as opposed to the                     html]
        structure of a tree), 2) Semantic                 Ilaraza, A. Diaz de, Mayor, A. and Sarasola, K.
        hierarchy needed for disambiguation                    (2001): Building a lexicon for an English-
        between English and Polish rarely                      Basque MT system from heterogeneous wide-
        subsumes the WordNet hierarchy                         coverage dictionaries. In: Proceedings of MT
     5. It proved very beneficial for the                      2000, Machine Translation and Multilingual
        organization of labor to store the                     Applications in the New Millenium, University
        lexicon in the form of a standard SQL                  of Exeter, United Kingdom, 20-22 November
                                                          Mayfield, J. and McNamee, P. (2002): Converting
Another conclusion concerns the translation                    on-line bilingual dictionaries from human-
algorithm itself. The transfer translation in                  readable to machine-readable form. In:
both direction differs strongly in one specific                Proceedings of the 25th Annual International
aspect: problem of homography. In Polish                       ACM SIGIR Conference on Research and
homography is quite rare because of rich                       Development in Information Retrieval, August
inflexion. An exemplary homographic                            11-15, 2002, Tampere, Finland
sentence: Przeslał (sent) mi (me) długi (long,

OPEP (2002): Wielki Słownik Angielsko-Polski           deployment of a new language pair. In:
    (English-Polish Dictionary), Wydawnictwo           Proceedings of the 19th International
    Naukowe PWN                                        Conference on Computational Linguistics,
Pinkham, J. and Smets, M. (2002): Modular MT           Taipei,    Taiwan,    pp.     800-806.
    with a learned bilingual dictionary: rapid

You can also read
NEXT SLIDES ... Cancel