MATRA: A PRACTICAL APPROACH TO FULLY-AUTOMATIC INDICATIVE ENGLISH-HINDI MACHINE TRANSLATION

Page created by Beverly Washington
 
CONTINUE READING
MaTra: A Practical Approach to Fully-Automatic Indicative English-
                   Hindi Machine Translation

              Ananthakrishnan R, Kavitha M, Jayprasad J Hegde,
           Chandra Shekhar, Ritesh Shah, Sawani Bade, Sasikumar M
  Centre for Development of Advanced Computing (formerly NCST),
                          Juhu, Mumbai-400049
                                   India
{anand,kavitham,jjhegde,shekhar,ritesh,sawani,sasi}@cdacmumbai.in

                                                    (iii) producing indicative rather than perfect
                   Abstract                         translations.

    MaTra is a fully automatic system for           MaTra is designed to be fully automatic, and is
    indicative     English-Hindi    Machine         geared towards translating real-world, general-
    Translation (MT) of general-purpose             purpose texts. As a tradeoff, MaTra makes the
    texts. This paper discusses the strengths       last of the aforementioned approximations.
    of the MaTra approach, especially
    focusing on the robust strategy for             The primary testing ground for MaTra is the
    parsing and the intuitive intermediate          Web, where many sentences are grammatically
    representation used by the system. This         incorrect or incomplete (fragments) containing
    approach allows convenient enhancement          unknown words and abbreviations. In such a
    of the linguistic capabilities of the           noisy environment, with incomplete knowledge,
    translation system, while making it             it is impractical to work with perfect models and
    possible for us to produce acceptable           grammars. In this scenario, the design of MaTra
    translations even as the system evolves.        represents a pragmatic approach to engineering a
    This paper also presents encouraging            usable system, which aims to produce
    results of automatic evaluation using           understandable output for wide coverage, rather
    BLEU/NIST and subjective evaluation             than perfect output for a limited range of
    by human judges.                                sentences.

1    Introduction and Background                    MaTra achieves this by using a judicious mix of
                                                    (i) corpus-based or statistical tools and
MaTra is a fully automatic system for indicative    techniques for shallow parsing, word-sense
English-Hindi Machine Translation (MT) of           disambiguation, abbreviation handling, and
general-purpose texts. This paper discusses the     transliteration, and (ii) rule-based techniques for
strengths of the MaTra approach, especially         lexical and structural transfer.
focusing on the robust strategy for parsing and
the intuitive intermediate representation used by   The structural transfer component has at its core
the system.                                         a relatively simple and intuitive intermediate
                                                    representation (which we call MSIR – MaTra
It is well accepted that Fully-Automatic, High      Structured Intermediate Representation) that can
Quality, General-Purpose MT is not achievable       accommodate most types of sentences that are
in the foreseeable future – more so for widely      found in real-world texts. The simplicity and
divergent languages like English and Hindi.         generality of the MSIR is an important aspect of
Existing MT systems work by relaxing one or         our approach. The level of detail, both syntactic
more of these three dimensions. They do this by:    and semantic, is just enough to capture the
(i) focusing on a fairly small subset of the        divergence between English and Hindi.
language (in domains such as weather
forecasting and official gazettes), or (ii) using
human assistance during or post translation, or
In keeping with our engineering outlook, MaTra       This paper is organized as follows: the next
does not attempt either a deep parse or an           section presents the overall architecture of
elaborate semantic analysis of the English           MaTra and describes the various components
sentence. The parsing strategy is hybrid – a         briefly. Section 3 discusses the intermediate
combination of statistical and rule-based            representation used by MaTra (MSIR). The
techniques. The parsing component is modular         parsing algorithm used to obtain the MSIR is
by design, with each stage working on a logically    presented in section 4. Section 5 provides
separate aspect of structural analysis. One of the   evaluation details. Section 6 concludes the paper.
primary goals of the design is graceful
degradation from full sentence structures all the
way down to word-by-word structures.

The design of the MSIR and the parsing
algorithm is the focus of this paper.

Fig 1: MaTra Architecture
Sentence             IndepClause|Conjunct|Fragment|MixedBag
Fragment             FragClause|Conjunct
Conjunct             Word, Sentence+
IndepClause          VerbFrame, Modifier*
VerbFrame            VerbGroup, Subject*,Object*,Modifier*
VerbGroup            Word+
FragClause           FragFrame, Modifier*
FragFrame            VerbGroup*, Subject*, Object*, Modifier*
Subject              NounPhrase|Sentence, Modifier*
Object               NounPhrase|Sentence, Modifier*
Modifier             PrepPhrase|DepClause
DepClause            Word*, FragFrame, Modifier*
NounPhrase           Word+
PrepPhrase           Preposition, NounPhrase
MixedBag            (Word|NounPhrase|PrepPhrase|DepClause|Fragment|IndepClause)+

               Fig 2: Grammar for the MaTra Structured Intermediate Representation (MSIR)
                                                      sentences [Mehta and Rao, 2003], and further, to
                                                      fragments and incomplete sentences. However,
                                                      the framework, as of now, does not handle
2    MaTra Architecture                               imperative and interrogative sentences. Figure 2
                                                      shows the grammar for the MSIR based on this
Essentially, MaTra follows a structural and
                                                      framework.
lexical transfer approach, using semantic
information only when required. Figure 1 shows
                                                      In this section, we first discuss how various types
the overall architecture of the system.
                                                      of clauses are represented in the MSIR. Then, we
                                                      look at the MSIR interpretations of Subjects,
The preprocessing component splits the input          Objects, and Modifiers, which are simplifications
text into sentences. Abbreviations, acronyms,         of the traditional definitions. Finally, we look at
dates, numeric expressions, etc. are identified at    how incomplete sentences and fragments are
this stage. Part-of-speech tagging and chunking       represented in the MSIR.
are done using the fast Transformation Based
Learning tool, fnTBL [Ngai and Florian, 2001].        3.1       Clauses
The word sense disambiguation component,              A clause is the basic unit of predication in any
then, chooses the appropriate Hindi mapping for       language, and forms the basis of the MSIR too.
English words and phrases. Next, the sentence-        A clause consists of a single verb group, which
structuring component converts the chunked            represents an action, event or state change. The
input into the MSIR that facilitates Hindi            verb group may consist of one or more verbs,
generation. A rule-based generation engine [Rao       including auxiliaries and pre-modifying adverbs,
et al., 1998; Rao et al., 2000; Mehta and Rao,        and may be finite or non-finite. Every verb has
2003] generates the Hindi translation from the        certain sub-categorization features, which define
MSIR using rule-bases for preposition-mapping,        the number and nature of other constituents that
noun-inflection, verb-inflection, and structural      attach with the verb to form the clause, namely,
transformation. The transliteration component         the subject, objects, complements, and modifiers.
handles unknown words. This component uses            Each clause modifier may itself be a separate
Genetic Algorithms to learn transliteration rules     clause. This recursion can be used to create
from a pronunciation dictionary. [Mishra, 2004].      sentences with a hierarchy of clauses.

3    MaTra Intermediate Structured                    Clauses may be classified, based on the types of
     Representation (MSIR)                            verbs that they contain, into the following three
                                                      categories:
   The core part of our transfer-based framework
for translation deals with the transfer of a single
                                                            •    Finite Clauses: clauses containing a
clause, which is adequate for handling mono-
                                                                 finite verb phrase. E.g., Jayshankar` has
finite sentences [Rao et al., 2000]. This has then
been systematically extended to handle multi-                    visited Delhi
finite sentences and non-finite clauses, thus               •    Non-finite clauses: clauses containing a
covering compound and compound-complex                           non-finite verb phrase but no finite verb
phrase. E.g., Having visited Delhi,        FragFrame to indicate the fact that the verb
            Jayshankar …                               group is optional, and (ii) it allows a
    •       Verbless clauses: clauses with no verb     subordinator (Word) in front of the FragFrame --
            phrase. E.g., Jayshankar, then at          subordinators are words such as wh-words,
            Delhi, …                                   although, because, etc.

Clauses are represented in the MSIR as either          3.1.3    Representing Compound Sentences
independent or dependent clauses. In our               Compound sentences are represented by
representation, independent clauses are always         connecting clauses with a Conjunct. Conjuncts
finite, whereas dependent clauses may be finite,       include and, but, or, etc., and also correlative
non-finite or verbless.                                conjunctions such as not only-but also and if-
                                                       then (see Figure 4).
Sentences are classified as Simple, Complex and
Compound based on the types of clauses that            3.2     Subject and Object
they contain. Simple sentences contain a single        Subject and Object in our representation have a
independent clause; complex sentences contain          more general interpretation than usual. Subjects
one independent and at least one dependent             are phrases (NounPhrase) or clauses with
clause connected by subordinators; and a               nominal function, which occur before the verb
compound sentence contains more than one               group (VerbGroup) in the sentence. Similarly,
independent clause connected by coordinating           Objects are phrases or clauses with nominal
conjunctions.                                          function, which occur after the VerbGroup in the
                                                       sentence. Thus, complements are also
In the following three subsections, we look at         represented as Subjects or Objects. These are
how the MSIR represents simple, complex and            simply syntactic entities. We do not attempt to
compound sentences respectively.                       determine semantic roles here.
3.1.1       Representing   an     Independent
                                                       Subjects and Objects may themselves be clauses.
            Clause – Simple Sentences
                                                       Figure 5 shows such an example.
The fundamental unit that represents a single
independent clause is a frame (VerbFrame) that         3.3     Modifiers
captures:                                              Modifiers can be attached to any component of a
                                                       clause (Modifier within Subject, Object,
    (i)        the action that is conveyed by the      VerbFrame, and FragFrame) or to a complete
               sentence      (VerbGroup,      which    clause as a whole (Modifier within IndepClause
               includes a single verb group and        and DepClause). This recursive nesting of
               auxiliaries),                           clauses can be used to represent complexity of
    (ii)       the entities that are involved in the   any arbitrary level.
               action (Subject and Object – see
               subsection 3.2)                         For simplicity, we club all modifiers except
    (iii)      any adverbials (Modifier – see          prepositional phrases as DepClause, in which
               subsection 3.3)                         phrases are represented as verbless clauses.
                                                       Prepositional phrases are represented using
Simple sentences, which have a single                  PrepPhrase.
independent clause, are represented by an
IndepClause, which in turn contains a                  3.4     Representing Incomplete Sentences and
VerbFrame.                                                     Fragments

3.1.2       Representing Dependent Clauses        –    We define Fragment to represent incomplete
            Complex Sentences                          sentences which end at chunk boundaries. This is
                                                       used to handle captions, titles, news headlines
The DepClause modifier when combined with an           etc., which are usually not full sentences.
IndepClause allows us to represent complex
sentences (see Figure 3).                              Incomplete sentences that do not end at chunk
                                                       boundaries usually cannot be parsed in full by
DepClause differs from IndepClause in the              the sentence-structuring algorithm. MixedBag is
following ways: (i) VerbFrame is replaced by
a fallback option that allows us to represent such
structures. Any identified clause will be
structured in the usual manner, while other parts
of the sentence will be represented as chunks
(see Figure 6).

                                                       Fig 5: MSIR representation for a sentence containing
                                                       a clause as a subject: To be the greatest batsman was
                                                       his dream

Fig 3: MSIR representation for a sentence containing
a dependent clause: Salim, who was the prince, was
sent to jail

                                                       Fig 6: MSIR representation for a fragment not ending
                                                       at a chunk boundary: Companies are using the Globus
                                                       Toolkit to develop.

                                                       generation component than most existing
                                                       grammars. As described, many categories of
                                                       sentence elements have been clubbed together
                                                       for simplicity of representation and parsing. For
                                                       instance, except for prepositional phrases,
                                                       modifiers     are     not   distinguished,    and
                                                       complements are clubbed with Subject or Object.
Fig 4: MSIR representation for a compound sentence:    Semantic roles are largely ignored. However,
If you want a healthy baby, you must eat well.         based on a preliminary evaluation of MaTra, it is
                                                       our contention that most of these details are not
The MSIR is a simple representation, which is          essential to English-Hindi indicative MT (see
reasonably easy to attain for most sentence types,     section 5 for evaluation details). We are working
using a straightforward parsing algorithm as           on larger scale and more thorough evaluation to
discussed in the next section. This representation     substantiate this claim.
is admittedly rather coarse, and supplies much
less syntactic and semantic information to the         Hindi generation from the MSIR (structural and
|                                                      lexical transfer) is described in detail in [Rao et
                                                       al., 1998; Rao et al., 2000; Mehta and Rao,
                                                       2003].
4       The Parsing Algorithm                                                             for each sentinel
                                                                                                extract subordinate clause and label
                                                                                                 as DepClause
The source sentence is first processed for
                                                                                                mark the position where the the
punctuations, numerals, contractions, acronyms,                                                  subordinate clause was originally
and other abbreviations. The sentence is then                                                    present
chunked. Since the clause is the basis of the                                             end for
                                                                                       End if
MSIR, the various clauses are then identified;                         End
this is done by the clause boundary detection
component. Each clause is then structured                              The following is an example of a clause
independently as discussed in section 3.1. The                         boundary output:
MSIR for the whole sentence is finally obtained
by putting these structured clauses together in a
manner that adheres to the MSIR.                                       {[Salim       NNP NP]   [, , -] DepClause#1 [, , -] [wasVBD sentVBN   VP]

                                                                       [toTO   PP]   [jailNN   NP]   }
Thus, the parsing stage can be divided into three                                               {
                                                                       DepClause#1= [whoWP NP] [wasVBD VP] [theDT princeNN NP]               }
major components: (i) the POS tagging and
chunking component, (ii) the clause boundary                           4.3           Sentence Structuring
detection component, and (iii) the sentence-
                                                                       The last stage of parsing is the sentence-
structuring component.                                                 structuring component. This takes the clause
4.1      POS Tagging & Chunking                                        boundaries and builds the MSIR from it.
POS tagging and chunking are done using the                            The algorithm for structuring the clauses and
fnTBL POS tagger and chunker [Ngai and                                 putting these structured clauses together to obtain
Florian, 2001], which is based on rules learnt                         the MSIR is as described below.
using Transformation Based Learning (TBL) on
the Wall Street Journal corpus. The POS tagger                         SentStruct()
chooses the most probable part-of-speech tag                           Begin
(from the UPenn tagset) for each word. The text                                        For each clause
chunking component of fnTBL groups the words                                                    ClauseStruct()
                                                                                        End For
into one of Noun, Verb, Adjective, and                                                 If no well-formed combination of clauses and
Preposition groups.                                                                    phrases
                                                                                                Create a tree with an unstructured
                                                                                                   collection of clauses and
The following is an example of fnTBL output:                                                       independent phrases

[SalimNNP NP] [, , -] [whoWP NP] [wasVBD V] [theDT princeNN NP] [, ,                   Else
-] [was VBD sentVBN VP ] [toTO PP ] [jailNN NP ]                                            If Conjunct present
                                                                                                   Root    Conjunct
4.2      Clause Boundary Detection                                                          End If
                                                                                            Attach each Structured Fragment or
The clause boundary detection works on the                                                     IndepClause to Root/Conjunct
chunked sentence. It identifies the various                                               Attach dependent clauses to their
clauses present in a well-formed sentence based                                                respective IndepClauses
                                                                                         Attach Clause Modifiers to respective
on verbs and clause boundary markers called                                                    clauses
sentinels (who, whom, which, etc.) [Narayana                                          End if
Murthy, 1996] as described below:                                      End

Begin                                                                  Each clause (with a single verb group) is then
           if verb not present                                         structured in the following way:
                 label as Fragment
           End if
                                                                       ClauseStruct()
           if sentence is well-formed                                  Begin
                 for each coordinating conjunction                         If not Fragment
                      label left clause of coord conjunction                     Attach the verb group to the root
                       as IndepClause
                 End for                                                               Search and attach Noun group occurring
                label the rest of sentence as IndepClause                                before the verb group as Subject

                                                                                       Search and attach prepositional/noun groups
                                                                                       as modifiers to the Subject
of the MSIR, parsing algorithm, and Hindi
           Search and attach Noun groups occurring
             after the verb group as Object. The
                                                         generation component.
             ordering determines whether the object is
             direct or indirect.                         BLEU [Papineni et al., 2002] and NIST
                                                         evaluations [Doddington, 2002] were done using
           For each object identify and attach
             prepositional groups as                     the NIST evaluation toolkit 1 . One reference
             modifiers to the object                     translation was used for each sentence. N-grams
                                                         (up to 4-grams) were matched after stemming.
           The remaining groups are attached as
             modifiers to the verb group
    else                                                 Table 1 shows the scores for the current version
           If verb group present                         of MaTra (called MaTra2) and the previous
               Root verb group                           version. The previous version, in addition to
               Search and attach verb
                   group modifiers                       imperative and interrogative sentences, also did
                Search and attach noun                   not have support for fragments and incomplete
                   group as object                       sentences. Non-finite clauses were also not
               Search and attach prepositional
                   groups as modifiers
                                                         supported. The scores suggest that substantial
           else                                          improvement has been made in MaTra2 over the
                Root   object                            previous version.
                Search and attach noun
                   group as object                                       Matra (Feb 05)          Matra2 (Mar 06)
               Search and attach prepositional
                   groups as modifiers
                                                         BLEU                0.0377                  0.0534
End                                                      NIST                2.1261                  3.1494
                                                                    Table 1: BLEU and NIST scores
The parsing algorithm currently does not support
imperative and interrogative sentences. Also, the
                                                         Manual inspection indicates that there are
implementation for non-finite clauses is not             differences in lexical choice in the reference and
complete.                                                system translations. Structural differences also
                                                         exist due to the free word-order of Hindi. These
5      Evaluation Strategy & Results
                                                         are partly the reason for the low absolute scores.
A preliminary test set of 315 sentences has been         To improve the automatic evaluation process, we
created by selecting sentences from news                 are increasing the number of reference
archives and other sources. This test set is in two      translations for BLEU/NIST. We are also
parts:                                                   looking at other automatic evaluation strategies
                                                         that perform matching of linguistic phrases rather
The first part of the test set contains declarative      than n-grams. Such a strategy would perhaps be
sentences and fragments exhibiting certain basic         more suited for a free word-order language like
grammatical phenomena, for example, sentences            Hindi.
with: (i) different types of clauses in various
roles (ii) different types of phrases, and (iii) all     Subjective evaluation was performed by two
tense-aspect-modality combinations. Evaluation           native Hindi speakers, who are also proficient in
shows that the MSIR and parsing algorithm can            English. These judges were able to identify most
handle such sentences well.                              transliterated words, which may not be true for a
                                                         person not knowing English. They graded
The second part includes sentences and                   translations on the following 4-point scale
fragments with (i) interrogative, subjunctive, and       [Sumita et al., 1999]. Translations marked as one
imperative mood, (ii) phrasal verbs, (iii) elliptical    of A, B, and C can be considered acceptable:
clauses (iv) idioms, etc. This part is intuitively
more difficult for MT. We are currently working           A) Perfect: no problems in both information and
on extending the MSIR and parsing algorithm to           grammar
accommodate these phenomena.                             B) Fair: easy-to-understand with some
                                                         unimportant information missing or flawed
The complete test set of 315 sentences was used          grammar
to evaluate the translations produced by MaTra.
This serves as an indication of the effectiveness
                                                         1
                                                             http://www.nist.gov/speech/tests/mt/resources/scoring.htm
C) Acceptable: broken but understandable with        References
effort
D) Nonsense: important information has been          Doddington, G. 2002. Automatic Evaluation of
translated incorrectly                                 Machine Translation Quality Using N-Gram Co-
                                                       Occurrence Statistics, Proceedings of the Second
Grade                      % of sentences              International Conference on Human Language
A                          12.7 %                      Technology, 2002.
(A + B)                    37.1 %                    Mishra A., Discovering Rules for Transliteration from
(A + B + C)                65.4 %                      English to Hindi: A Genetic Algorithms approach,
      Table 2: Subjective evaluation scores            MCA thesis, SSIT Orissa, 2004.
                                                     Narayana Murthy, K., Universal Clause Structure
                                                       Grammar, Ph.D. thesis, Dept. of Computer and
Table 2 shows the results of the subjective            Information Sciences, University of Hyderabad,
evaluation. Though the number of perfect               1996.
translations is low (12.7%), it is highly            Mehta V. and Rao D., Natural Language Generation
encouraging that more than 65% of the                  of Compound-Complex Sentences for English-
translations were rated as acceptable.                 Hindi Machine Aided Translation, Proceedings of
                                                       the Symposium on Translation Support Systems
6   Conclusion                                         (STRANS) 2003.

In this paper, we have described a practical         Mohanraj K., Hegde J., Dogra N., Ananthakrishnan
                                                      R., The MaTra Lexicon, Technical Report, CDAC
strategy for designing an English-Hindi machine
                                                      Mumbai, 2003.
translation system. The system is geared towards
translating real-world, general-purpose texts with   Ngai G., and Florian R., Transformation-Based
a high level of noise, such as those on the Web.       Learning in the Fast Lane, Proceedings of North
The design of the system represents a pragmatic        American      Association  for    Computational
                                                       Linguistics (NAACL), 2001.
approach to engineering a usable system, which
aims to produce understandable output for wide       Papineni, Kishore, Salim Roukos, Todd Ward, and
coverage, rather than perfect output for a limited     WeiJing Zhu, 2002. Bleu: A Method for Automatic
range of sentences. The design is based on an          Evaluation of Machine Translation, Proceedings of
intuitive intermediate representation (MSIR) that      the 40th Annual Meeting of the Association for
                                                       Computational Linguistics (ACL), 2002.
especially simplifies the parsing algorithm, as
described in the paper. The approach allows          Rao D., Bhattacharya P. and Mamidi R., Natural
convenient enhancement of the linguistic               Language Generation for English to Hindi Human-
capabilities of the translation system, while          Aided Machine Translation, Proceedings of the
making it possible for us to produce acceptable        International Conference on Knowledge Based
translations even as the system evolves.               Computer Systems (KBCS), 1998.
                                                     Rao D., Mohanraj K., Hegde J., Mehta V., and
Though at an early stage of implementation,            Mahadane P., A Practical Framework for Syntactic
preliminary evaluation of the system is very           Transfer of Compound-Complex Sentences for
encouraging -- more than 65% of translations           English-Hindi Machine Translation, Proceedings
were rated as acceptable by human judges. The          of the International Conference on Knowledge
                                                       Based Computer Systems (KBCS), 2000.
paper also reports automatic evaluation results
using BLEU and NIST.                                 Sumita E., Yamada S., Yamamoto K., Paul M.,
                                                       Kashioka H., Ishikawa K., and Shirai S., Solutions
Future work will look at handling interrogative,       to Problems Inherent in Spoken-Language
subjunctive and imperative moods, phrasal verbs,       Translation:    the   ATR-MATRIX        Approach,
                                                       Proceedings of MT Summit VII, 1999.
idiomatic usages, etc. Larger scale and
component-wise evaluation of the system is also
being planned.
You can also read