Automatic Translation of Noun Compounds from English to Hindi

Page created by Sally Mack
 
CONTINUE READING
Automatic Translation of Noun Compounds from English to Hindi
Automatic Translation of Noun Compounds from English to Hindi

                 Thesis submitted in partial fulfillment
                 of the requirements for the degree of

                           MS by Research
                                  in
              Computer Science with specialization in NLP

                                   by

                            Prashant Mathur
                              200502016
                    mathur@research.iiit.ac.in

                 Language Technology Research Center
            International Institute of Information Technology
                      Hyderabad - 500032, INDIA
                               October, 2011
Automatic Translation of Noun Compounds from English to Hindi
Copyright c Prashant Mathur, 2010
       All Rights Reserved
Automatic Translation of Noun Compounds from English to Hindi
International Institute of Information Technology
                                   Hyderabad, India

                                        CERTIFICATE

It is certified that the work contained in this thesis, titled “Automatic Translation of Noun Compounds
from English to Hindi” by Prashant Mathur, has been carried out under my supervision and is not
submitted elsewhere for a degree.

          Date                                                   Advisor: Dr. Soma Paul
Automatic Translation of Noun Compounds from English to Hindi
To Papa and Mummy
For being the world’s best parents
Acknowledgments

   I would like to express my deepest of gratitude for my advisor Dr. Soma Paul, without whom this
thesis would’nt have been possible. She always inspired me and had faith in my ability to rise to the
occasion and deliver the best work. She equally worked with me on the thesis as I did. She taught me
the course Computational Linguistics and since then I am working with her. I am also thankful to Prof.
Rajeev Sangal for his valuable comments on thesis. I started my journey in LTRC under the supervision
of Dr. V. Sriram on Machine Translation. Everytime I asked for his help he was there clearing my
doubts. Thanks for all the proof reading Sir and the inspiration you gave me that I ended up pursuing
Post-Graduation studies on Machine Translation.

   The only unsinkable ship is FRIENDSHIP. I would also like to thank my Lab mates and my friends
Vipul Mittal, Sambhav Jain, Bharat Ram, Himani, Karthik, Siva Reddy, Avinesh, Ravikiran, Samar
Hussain for creating such a wonderful working environment in LTRC. I would also like to thank my
batchmates Abhijeet, Raman, Piyush, Himank, Shrikant, Aditya, Manish, Subhashis, Karan, Chirag,
Maruti, Prashant K., Vibhav for being the friends they were. It would be disrespectful of me if I didn’t
mention my 5th year buddies Mahaveer, Kulbir Chacha, Saurabh, Rahul, Abhishek.
Last year wasn’t fun if there were no playing cards and computer games. I would really like to mention
my CS buddies and my clan IM M ORT ALS.
Above all I would like to mention my cousins and my dear ones Krishna Mathur, Devika, Chandni
Mathur, Deepika Mathur for being there at the time I needed them and sometimes on the other side of
the outburst. Love you all.

                                                   v
Abstract

    The present work attempts to build an automatic translation system of nominal compound (NC) from
English to Hindi. A noun compound is a sequence of nouns acting as a single noun, e.g., colon cancer,
suppressor protein, colon cancer tumor suppressor protein. They comprise 3.9% and 2.6% of all tokens
in the Reuters corpus and the British National Corpus (BNC), respectively. As of today, no good system
exists for the translation of multi-word expressions from English to any Indian languages. We have
evaluated two state-of-the-art systems, Moses and Google Translation system, to check the Noun Com-
pound translation accuracy from English to Hindi. Google translation system results in an accuracy of
57% while Moses, a statistical machine translation system, returns an accuracy of 48% on a test data of
300 Noun Compounds. The above figures indicate that automatic NC translation from English to Hindi
is an important subtask of machine translation system. We build a Noun Compound Translation system
(NCT) which returns an accuracy of 64% on the same set of test data.
This thesis examines two approaches for translation of Noun Compounds from English to Hindi. We
have done a manual study on 50K parallel sentences from English to Hindi and have found out that Noun
Compounds in English are translated into Noun Compound in Hindi in over 40% of the cases. In other
cases they are translated into varied syntactic constructs. Among them the most frequent construction
type is “Modifier + Post-Position + Head” which occurs in 35% of all the cases. Some examples are
“cow milk” → “gAya kA dUXa”, “wax work” → “mOMa para ciwroM”. This observation motivates
both the approaches for translation in the present thesis. The approaches are called in this work as: a)
Translation of NC by paraphrasing on source side and mapping the paraphrase to target construct and
b) Context based translation by searching and ranking translation candidate on target side.
In the first approach English nominal compounds are automatically paraphrased and the paraphrases are
translated into Hindi constructions. The paraphrasing is done with prepositions following [Lauer 1995]
approach of paraphrasing of nominal compound. For example, “cow milk” is paraphrased as ‘milk from
cow’, “blood sugar” is paraphrased as ‘sugar in blood’. Since English prepositions have one-to-one
mapping to post-position in Hindi, English paraphrases are easily translated into Hindi using the map-
ping schema. Assuming that lexical substitution for component nouns of the compound is correct, this
method examines how paraphrasing of English nominal compound acts as an aid for translation.
In the second approach, we, at first, generate translation templates for the target language. These tem-
plates are all possible Hindi construction types that English nominal compounds can be translated into.
Context based translation system take context into consideration while translating. We translate noun

                                                   vi
vii

compound by taking the sentence in which the compound occurs as the context. For example, the expres-
sion “finance minister” is the nominal compound to be translated in the sentence “The finance minister
declared the financial budget for this year”. Other content words in the sentence such as ‘declared’,
‘financial’, ‘budget’, ‘year’ form the context. We apply a Word-sense-disambiguation tool for selecting
the correct sense of the component nouns of NC in the given context. We use a bilingual dictionary to
get the Hindi translation of the component nouns in the sense selected by WSD tool. Thus context based
lexical substitution is accomplished for the target language. The output of lexical substitution is placed
in the translation templates and the resulted construction is searched on a Hindi indexed corpus of 28
million words. For ranking, a reference ranking based on the frequency of occurrence of the translate
candidates in full in the TL corpora is taken as baseline. To improve on the baseline, a stronger ranking
measure is borrowed from [Tanaka & Baldwin 2003b].
The context based translation system approach is adopted in the present work for building the noun
compound translation system (NCT) which is integrated to Moses. The outputs of Moses and Moses
integrated with NCT are compared. Evaluation of the system is carried out at two levels: by automatic
evaluation metric BLEU and by manual evaluation technique. The issue of automatic evaluation is
discussed in detail which motivates manual evaluation under the given circumstance.
Contents

Chapter                                                                                                     Page

1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
    1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    1
    1.2 Approaches to Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . .         2
    1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       3
    1.4 Noun Compound in English and its Translation . . . . . . . . . . . . . . . . . . . . .          3
    1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    5
    1.6 Various approaches for translation of Noun Compounds . . . . . . . . . . . . . . . . .          6
          1.6.1 Translation of NC by paraphrasing on source side and mapping the paraphrase
                 to target construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
          1.6.2 Context Based Machine Translation . . . . . . . . . . . . . . . . . . . . . . .         7
    1.7 Developing an Integrated MT System and its Evaluation . . . . . . . . . . . . . . . .           8
    1.8 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      9
    1.9 Chapterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   10

2   Translation of Noun Compounds using paraphrasing on source side . . . . . . . . . .          .    .    . . 11
    2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .    .   .   11
    2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .    .   .   11
    2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .    .   .   13
    2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .    .   .   17
    2.5 Paraphrasing of Noun Compounds . . . . . . . . . . . . . . . . . . . . . . . . .          .    .   .   17
    2.6 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .    .   .   19
    2.7 Translation of Noun Compounds . . . . . . . . . . . . . . . . . . . . . . . . . .         .    .   .   20
         2.7.1 Mapping English Preposition to Hindi Post-position . . . . . . . . . . .           .    .   .   20
    2.8 Translation of Noun Compounds: Experiments and Result . . . . . . . . . . . .             .    .   .   21
    2.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .    .   .   23

3   Context Based Translation of Noun Compounds . . . . . . . . . . . . . . . . . . . .          .    .    . . 24
    3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .    .   .   24
    3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .    .   .   24
    3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .    .   .   26
    3.4 Preparation of Data and Approach . . . . . . . . . . . . . . . . . . . . . . . . .        .    .   .   31
         3.4.1 Preparation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .    .   .   31
         3.4.2 Generation of Translation Templates . . . . . . . . . . . . . . . . . . . .        .    .   .   31
         3.4.3 Sense Selection for components of Noun Compound . . . . . . . . . . .              .    .   .   32
         3.4.4 Corpus Search and Ranking . . . . . . . . . . . . . . . . . . . . . . . .          .    .   .   34

                                                   viii
CONTENTS                                                                                                        ix

    3.5   Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           34
    3.6   Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           36

4   Integration of Noun Compound Translator with Moses and its Evaluation . . . . . . .           .    .    . . 37
    4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .    .   .   37
    4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .    .   .   37
    4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .    .   .   38
    4.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .    .   .   39
    4.5 Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .    .   .   40
    4.6 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .    .   .   43
    4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .    .   .   45
    4.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .         .    .   .   48

5   Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Appendix A: Templates for candidate generation . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
List of Figures

Figure                                                                                                 Page

   1.1   Translation of Noun Compounds through Paraphrasing . . . . . . . . . . . . . . . . .            7
   1.2   Noun Compound Translator (NCT) . . . . . . . . . . . . . . . . . . . . . . . . . . .            8
   1.3   Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     9

   4.1   Phrase based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       40
   4.2   Distance based Reordering : Reordering distance is measured on the foreign input side.
         In the illustration each foreign phrase is annotated with a dashed arrow indicating the
         extent of reordering. For instance the 2nd English phrase translates the foreign word 6,
         skipping over the words 4-5, a distance of +2. . . . . . . . . . . . . . . . . . . . . . .     42
   4.3   Lexicalized Reordering (Y-Axis : Source Phrase, X-Axis : Target Phrase) . . . . . . .          42

                                                    x
List of Tables

Table                                                                                                       Page

   2.1  Frequency of Paraphrases for “finance minister” resulted from Web search. . . . . . .                18
   2.2  Frequency of Paraphrases for “welfare agencies” resulted from Web search. . . . . . .                18
   2.3  Frequency of Paraphrases for antelope species after Web search. . . . . . . . . . . . .              19
   2.4  Paraphrasing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            19
   2.5  Distribution of Preposition on Lauer test data of 218 NC . . . . . . . . . . . . . . . .             20
   2.6  Comparison of our approach with Lauer’s Approach . . . . . . . . . . . . . . . . . .                 20
   2.7  Mapping of English Preposition to Hindi postposition from aligned English-Hindi par-
        allel corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         21
   2.8 Preposition-Postposition Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .              22
   2.9 Translation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            22
   2.10 Translation Accuracy for some individual prepositions . . . . . . . . . . . . . . . . .              22

   3.1   Distribution of translations of English NC from English Hindi parallel corpora. . . . .             25
   3.2   Number of Senses Listed in Wordnet . . . . . . . . . . . . . . . . . . . . . . . . . . .            32
   3.3   Synset selected by WSD tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           33
   3.4   Translation using bilingual dictionary . . . . . . . . . . . . . . . . . . . . . . . . . .          33
   3.5   Ranking using baseline frequency model . . . . . . . . . . . . . . . . . . . . . . . . .            35
   3.6   Ranking using CTQ Metric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .              35
   3.7   Ranking after inclusion of default genitive translation i.e.. X kA Y, X ke Y, X kI Y as
         templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        36

   4.1   Corpus Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .    39
   4.2   NC translation accuracy(Surface Level) on the test data. . . . . . . . . . . . . . . .     .   .    45
   4.3   BLEU scores on the test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .    46
   4.4   BLEU scores on the development set . . . . . . . . . . . . . . . . . . . . . . . . .       .   .    46
   4.5   5 point scale for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .    47
   4.6   Human Judgment score of translation of sentences and the NC Translation accuracy           .   .    48
   4.7   Performance of systems on top 3 constructions for NC Translation . . . . . . . . .         .   .    48

                                                    xi
Chapter 1

                                           Introduction

1.1    Introduction
    Machine Translation is a sub-field of Computational Linguistics that makes use of computers to
translate text/speech from one language to another. Efforts to build machine translation systems started
almost as soon as electronic computers came into existence. Computers were used in Britain to crack the
German Enigma code in World War II and decoding language codes is what we call machine translation
in today’s world. Warren Weaver, one of the pioneering minds in machine translation, wrote in 1947:

         When I look at an article in Russian, I say: “This is really written in English, but it has
      been coded in some strange symbols. I will now proceed to decode”.

Hindi is a widely spoken language and it is the principal official language of the Republic of India. On
the other hand, English is internationally popular language. In India, English as a language has played
a major role in administration, legal and education sector since British period. Presently, an awareness
has been developed in this country for using regional languages for government document writing, for
primary and higher education and every other domain of public life. In this context, it has become very
important to build system which can translate English to various Indian languages. With the existence
of huge text resources in internet and India being one of the most prominent users of web, even com-
mercial companies are finding it necessary to venture out for building machine translation.
With the emergence of India in the global market as one of the superior powers, and Hindi being its of-
ficial language, the language has reached to outside world too. English to Hindi translation has become
of great importance because the country has good trade relations with English spoken countries such
as USA, Australia, England and other European nations. English is an international industrial language
and companies from all around the world are investing in Indian market. English to Hindi translation has
become a necessity if the companies want to promote or pass information to those people who cannot
understand English. The coming of industries from abroad such as Quillpad, AIAIOO labs, MSR labs
has boosted the translation service industry in India, but as we know everyone wants the world to be
more automaticised and so the development of such tool is what one needs.

                                                    1
Whenever there is a language barrier between two individuals every time an intervention by a bi-lingual
speaker is required which is a costly task. We are striving to make the world an automated state and
so are we. An automated machine translation system can reduce the human task to quite an extent.
A typical Machine Translation system can be used anywhere where there are two different individuals
trying to communicate but non of them knows a common language which creates a communication gap.
Some of the popular translation systems that are being developed are Google Translate, Yahoo! Babel
fish, Bing-Translator, Moses [Koehn et al. 2007]. In India two of the major projects on machine trans-
lation are Anusaaraka, English to Indian Language Machine Translation Project(EILMT). Anusaaraka
is an English to Indian language accessing (translation) software, which employs algorithms derived
from Panini’s Ashtadhyayi (Grammar rules). The EILMT system aims to design and deploy a Machine
Translation System from English to Indian Languages in Tourism and Healthcare Domains. The project
is funded by Department of Information Technology, MCIT, Government of India. Shakti is also a ma-
chine translation system from English to Indian languages. It is currently being developed at Language
Technology Research Centre.
These applications on translation from one language to another has become so popular and widely used
that people are developing Mobile Applications such as Speech-to-Speech translator, Speech-to-text,
Text-to-Text on different OS platforms like Android, iOs, Symbian.

1.2    Approaches to Machine Translation
   Translation can be defined as a process in two steps :

   1. Decoding the meaning of source text.

   2. Re-Encoding the meaning in the target language.

There are 4 basic types of Machine Translation systems

   1. Rule Based Machine Translation : Translation process that uses the linguistic information of
      both source and target language such as morphological information, difference in syntactic be-
      haviour.

       (a) Transfer based Machine Translation : Translation process which builds intermediate repre-
           sentation of source text that captures its ”meaning” in order to generate the correct transla-
           tion. [Tsuji & Fujita 1991]
       (b) Interlingual Machine Translation : source text is transformed into an interlingual, i.e. source-
           /target-language-independent representation. The target language is then generated out of
           the interlingua. [Lampert 2004]
       (c) Dictionary based Machine Translation : Word by word translation using a bi-lingual dictio-
           nary. [Muegge 2006]

                                                    2
2. Statistical Machine Translation : Statistical Machine Translation is a typical example of Ma-
      chine Translation where we need statistical models which come from analyzing a parallel text
      corpora to predict translation of a sentence from one language to another. [Brown et. al. 1993]

   3. Example Based Machine Translation : In this translation is accomplished by decomposing a
      sentence into certain phrases, then by translating these phrases, and finally by properly composing
      these fragments into one long sentence. Phrasal translations are translated by analogy to previous
      translations. [Brown 1996]

   4. Hybrid Machine Translation : A combination of statistical, example-based and rule-based ap-
      proach that leverages the best use out of each system.

1.3    Problem Statement

    The thesis aims at automatic translation of English Noun Compound into Hindi within a sentence.
Here is an example:
English : The ‘coast guard’ was constituted by a ‘parliament act’ in the year 1978.
Hindi : 1978 meM bhArawa sarakAra ne ‘saMsada dvArA pArita adhiniyama’ ke aMtagarta ‘taTarak-
Saka dala’ kA gaThana kiyA .
The expression parliament act is translated to saMsada dvArA pArita adhiniyama while coast guard
translates to tatarakRaka dala. In the first case, the source language compound is translated as a phrasal
construction in Hindi; while in the second case, the noun compound remains as a noun compound in
Hindi.
We attempt to develop a system that automatically translate English nominal compound into Hindi. The
tool is named as Noun Compound Translation system (NCT). We also integrate the NCT tool with a
state-of-the-art translation system, Moses and evaluate the integrated System.

1.4    Noun Compound in English and its Translation

    Compounds have regularly received attention of linguists. Noun compounds are abundant in English
and pose an important challenge for the automatic analysis of English written text. A two word noun
compound (henceforth NC) is a construct of two nouns, the rightmost noun being the head (H) and the
preceding noun the modifier (M). The noun constituents together act as a single noun [Downing 1977].
Some examples are ‘cow milk’, ‘road condition’, ‘machine translation’, ‘colon cancer’, ‘suppressor
protein’ and so on. A noun compound can have a more complex structure as illustrated in ‘customer
satisfaction indices’, ‘wrought iron office chair’ ‘routine health check up’, ‘fire and rescue service de-
partment’ and so on. The analysis of the aforementioned noun compounds can be done as follows:

                                                    3
1. ‘Customer satisfaction indices’: [[Customer satisfaction] indices]

   2. ‘wrought iron office chair’: [[wrought iron] [office] chair]

   3. ‘routine health check up’: [routine [health [check up]]]

   4. ‘fire and rescue service department’: [[[[fire] and [rescue]] service] department]

The complex structure of noun compound establishes the fact that analysis and translation of these com-
pounds from one language to another is not an easy task. The translation becomes significantly difficult
in those cases when the source language NC is represented in a varied manner in the target language
as is the case with English - Hindi language pair. We have done a manual study on the BNC corpus in
which we have found that English noun compounds can be translated in Hindi in following varied ways:

   1. Noun Compound

       (a) Hindu Texts → hiMdU SAstroM

       (b) Milk Production → dugdha utpAdana

   2. Genetive Construction

       (a) Rice Husks → cAval kI bhUsI

       (b) Room Temperature → kamare kA tApamAna

   3. Adjective Noun Construction

       (a) Nature Care → prAkritika cikitsA

       (b) Hill Camel → pahARI UMTa

   4. Other Syntactic Phrase

       (a) Wax Work → mom par citroM

       (b) Body Pain → SarIr meM dard

   5. One Word

                                                    4
(a) Cow Dung → gobara

   6. Others

         • Hand Luggage → hATa meM le jAye jAne vala sAmAn.

When the compound is made up of more than two words, the translation issues also become more acute
sometimes. For example, let us study following cases of translation:

   1. ‘body pain cure’ → ‘SarIra meM darda kA ilAja’,

   2. ‘state forest department’ → ‘rAjya vana vibhAga’

   3. ‘Hindi language alphabet’ → ‘hindI bhASa kI varNamAlA’.

In first case “E1 E2 E3 ” is translated into “H1 post-position H2 post-position H3 ”, in second case the
compound “E1 E2 E3 ” remains a compound “H1 H2 H3 ”; while in the third case “E1 E2 E3 ” is translated
as “H1 H2 post-position H3 ”. In the present work, we will be handling only bigram noun compound
for translating them into Hindi.
Furthermore, compounding is an extremely productive process in English. The frequency spectrum of
compound types follows a Zipfian or power-law distribution, so in effect many compound tokens en-
countered belong to a “long tail” of low-frequency types. Over half of the two-noun compound types
in the British National Corpus occur just once. Taken together, the factors of low frequency and high
productivity means that achieving a robust on noun-compound interpretation is an important goal for
broad-coverage in semantic processing.

1.5    Motivation
    Noun Compounds constitutes an important part of English running texts. Translation of Noun Com-
pounds is a very widely researched topic. No work specific to noun compound translation from English
to Hindi has been attempted till now. We have given some sentences with NC for translation to the
EILMT system. The output that the system returns is the following:
English : ‘Keep the room temperature to 30 degrees, I am suffering from body pain .’
Hindi : ‘30 digrI ko kakSa tApamAna ko rakheM, meM SarIr dard se kaSta bhugawa rahA hUM’
In this example the translations of the Noun Compounds ‘room temperature’ and ‘body pain’ is not
correct. The performance of existing translation system makes the point clear that there exists no sat-
isfactorily efficient Noun Compound translation tool from English to Hindi although the need of one is
unprecedented in the context of machine translation. This observation motivates the present work to de-
velop English-Hindi noun compound translator tool. Similar algorithm will work for Indian languages
which are closely related to Hindi.

                                                   5
1.6     Various approaches for translation of Noun Compounds

   There are various approaches for translation of Noun Compounds such as Transfer based machine
translation, Memory Based Machine Translation, Word-to-word compositional MT etc. The review
of translation of noun compound as discussed in the previous section motivates us to implement two
approaches both of which are corpus driven statistical approach. The two approaches can be described
as a) Translation by paraphrasing on source side and mapping the paraphrase to target construct and
b) Context based translation by searching and ranking translation candidate on target side. The two
approaches are introduced here and their implementation will be described in chapter 2 and 3.

1.6.1   Translation of NC by paraphrasing on source side and mapping the paraphrase to
        target construct

    We develop a system that uses paraphrase on source side as an aid to translate a Noun Compound
from English to Hindi. In this work we present a way of using paraphrasal interpretation of English
nominal compound for translating them into Hindi. Input Nominal compound is first paraphrased auto-
matically with the 8 prepositions as proposed by [Lauer 1995] for the task. The detail of the process is
described in Chapter 2. English prepositions have one-to-one mapping to post-positions in Hindi. We
obtain an accuracy of 71% over a set of gold data of 250 Nominal Compound. The translation-strategy
is motivated by the following observation: It is only 50% of the cases that English nominal compound
is translated into nominal compound in Hindi. In other cases, they are translated into varied syntac-
tic constructs. Among them the most frequent construction type is “Modifier + Postposition + Head”.
The translation module also attempts to determine when a compound is translated using paraphrase and
when it is translated into a Nominal compound.
The following flow chart describes the approach: In this method we assume the translation of compo-
nents nouns to be correct because our objective of conducting the experiment is to evaluate the accuracy
of translation of paraphrase construct. We implement another method for translation of Noun Com-
pounds which is described in chapter 3.

                                                   6
Figure 1.1 Translation of Noun Compounds through Paraphrasing

1.6.2   Context Based Machine Translation

   In Context Based Machine Translation we translate a Noun Compound taking the sentence as the
context. The translation is carried out in two steps:

   1. Context information is utilized for correct lexical substitution of components nouns of English
      NC.

   2. Hindi templates for potential translation candidates are generated which are searched in the target
      language data.

Let us consider the following example, Soil on the river bank eroded due to the flood. The expression
river bank is the noun compound and the content words make the context (in this case: soil, river, bank,
eroded, flood). We use a Word-sense-disambiguation tool [Patwardhan et. al. 2005] to derive the sense
of the component nouns used in the compound in the given context. We will see in Chapter 3 that cor-
rect sense selection of component nouns significantly improves the translation of the compound. Once
correct sense is selected for component nouns and the nouns are substituted in Hindi for that sense, the
substituent words are fit in translation template to generate the translation candidates. These transla-
tion candidates are searched in Web and the best translation candidate selected by ranking method (see
chapter 3) is taken to be the translation of the noun compound.
In this work we present an architecture of a “Context Based Machine Translation - Noun Compound
Translator” that has been able to give an accuracy of 57% accuracy over a test set of 200 Noun Com-

                                                   7
pounds. Figure 1.2 represents functionality of Noun Compound Translator tool. We integrated the NCT
tool with state-of-the-art translation system, Moses as discussed in next section.

                           Figure 1.2 Noun Compound Translator (NCT)

1.7    Developing an Integrated MT System and its Evaluation

   We present a translation system by combining the state-of-the-art machine translation tool, Moses
and Noun Compound Translator. The need to do this integration arises from our urge to know whether
the output of NCT brings in any improvement to the overall quality of sentence translation when the
system is plugged in to a full-fledged machine translation system. NCT is a phrase based system and
hence integrating it with another phrase based system ,which Moses is, makes the integration an easier
task than integrating it with a Syntax based SMT or any other SMT systems such as Example-based MT,
Tree based MT. Integrated system is built by combining Moses phrase based model and the translations
from NCT system. Moses decoder uses the enhanced model for translation of sentences.
The integrated system is also evaluated both manually and automatically. We will evaluate the systems
on test data of 300 parallel sentences containing Noun Compounds(NCs) and its translation.
Figure 1.3 represents the functionality of the Moses system combined with NCT tool.

                                                  8
Figure 1.3 Moses

1.8   Contribution of the Thesis
  1. Two approaches for translation of Noun Compounds are presented and compared with state of the
     art system.

      (a) Proposed a new approach for translation of Noun Compounds by paraphrasing on source
          side.
      (b) Developed a context based machine translation system (Noun Compound Translator) for
          English-Hindi language pair.

  2. The principle mechanism of two approaches are different

      (a) Paraphrases of NCs are searched on English web corpus i.e. search source language corpus.
      (b) Hindi translated candidates are searched on indexed hindi corpus i.e. search on target lan-
          guage corpus.

  3. NCT tool is built as a module that can easily be integrated with any working machine translation
     system.

  4. The system is integrated with state-of-the-art Statistical machine translation system, Moses and
     evaluated.

                                                 9
5. Evaluation of the system is carried out at two levels: by automatic evaluation metric BLEU and
      by manual evaluation technique.

1.9    Chapterization
   This thesis is divided in a number of Chapters. Chapter 2 explains the approach of translating para-
phrase of English noun compound into Hindi. Chapter 3 describes Context based MT system that we
have integrated with a full-fledged MT system, Moses. The integration is described in chapter 4. The
evaluation report of translation after integration is also presented in this chapter. Finally the conclusion
chapter summarizes our work and discusses the future task.

                                                    10
Chapter 2

      Translation of Noun Compounds using paraphrasing on source side

2.1     Overview
    This chapter argues that paraphrasing of source language noun compound can be used as an aid in
translation of English noun compounds in Hindi. We have discussed and implemented paraphrase by
preposition method for interpreting noun compound in this chapter. The result of the implementation is
also reported. For example, the meaning of nominal compound ‘carbon deposit’ and ‘wax work’ can be
conveyed as ‘deposit of carbon’ and ‘work on wax’.
In this method, we have automatically paraphrased English noun compounds with 8 prepositions as
proposed by [Lauer 1995]. We have used the paraphrase frequency found on the web as the base for
scoring that paraphrase. We have developed one-to-one mapping of English preposition to post-position
in Hindi. Finally we have translated the top 3 paraphrases into Hindi using a mapping schema and
bilingual dictionary. We have got an accuracy of 71% over a set of gold data of 250 Noun Compounds.
We have assumed that lexical substitution of constituent nouns of the compound is correct during eval-
uation because our objective of conducting the experiment is to evaluate the accuracy of translation of
paraphrase construct. The translation-strategy is motivated by the following observation: It is only 50%
of the cases that English noun compound is translated into noun compound in Hindi. In other cases,
they are translated into varied syntactic constructs. Among them the most frequent construction type is
“Modifier + Postposition + Head”. The translation module also attempts to determine when a compound
is translated using paraphrase and when it is translated into a Noun compound.

2.2     Introduction
    Noun compounds are abundant in English and compounding is a highly productive phenomenon.
[Baldwin & Tanaka 2004] has calculated that noun compounds comprise 3.9% and 2.6% of all tokens
in the Reuters corpus and the British National Corpus (BNC), respectively. The frequency spectrum of
NCs follows a Zipfian or a power-law distribution, so in effect many compound tokens encountered in a
text belong to a “long tail” of low frequency types as described in [Seaghdha 2008]. It is difficult to list

                                                    11
in a dictionary all compounds that are likely to be encountered. [Tanaka & Baldwin 2003b] has reported
that even for relatively frequent noun compounds that occur ten or more times in BNC, static English
dictionaries provide only 27% coverage.
Understanding the syntax and semantics of noun compounds is difficult but important for many natural
language applications including machine translation. Rackow [Rackow et al. 1992] has observed two
main issues in translating a noun compound from source language to target language (a) correctness in
the choice of the appropriate target lexeme during lexical substitution and (b) correctness in the selec-
tion of the right target construct type. The issue stated in (a) involves correct selection of sense of the
component words of NCs followed by substitution of source language word with that of target language
that best fits for the selected sense [Mathur & Paul 2009]. From the perspective of machine translation,
the issue of selecting the right construct of target language becomes very significant because English
NCs are translated into varied construct types in Hindi.
We have done a manual study on 50K parallel sentences from English to Hindi and have found out that
Noun Compounds in English are translated into Noun Compound in Hindi in over 40% of the cases.
In other cases they are translated into varied syntactic constructs. Among them the most frequent con-
struction type is “Modifier + Post-Position + Head” which occurs in 35% of all the cases (for illustration
see Chapter 1). The paraphrases can facilitate the translation in those cases. For example, let us take
the compound ‘wax work’ and ‘carbon deposit’. The paraphrase of these compounds is ‘work on wax’
and ‘deposit of carbon’. Since English preposition can mostly be mapped to Hindi postposition in one-
to-one manner, the paraphrases can efficiently be translated to correct Hindi constructions as follows:
‘work on wax’ → ‘mOm para citra’ (wax on work) and ‘deposit of carbon’ → ‘kOyle kA bhandAra’
(coal of deposit).
  Currently there exist two different approaches in Computational Linguistics for interpretation of
Noun Compounds. They are:

   1. Labeling the semantics of compound with a set of abstract relations [Girju et. al. 2003]
           Eg. Chocolate Bar → bar made of chocolate (MAKE - relation)

   2. Paraphrasing the compound in terms of syntactic constructs.

Paraphrasing, again, is done in three ways:

   1. With prepositions (“war story” → “story about war”) [Lauer 1995].

   2. With verb+preposition nexus (“war story” → “story pertaining to war”, “noise pollution” → “pol-
      lution caused by noise”) [Finin 1980], [Nakov & Hearst 2005], [Nastase & Szpakowicz 2003].

   3. With Copula (“tuna fish” → “fish that is tuna”) [Vanderwende 1995].

We have implemented the first approach for paraphrasing.
This chapter motivates the advantage of expanding English noun compounds into “paraphrases with

                                                    12
prepositions” for translating them into Hindi. “Paraphrasing with Preposition” has the following ad-
vantages: (a) Annotation is simpler; (b) Learning is easier and (c) Data sparseness is less; (d) Most
importantly, English prepositions have one to one Hindi postposition correspondents most of the times.
Therefore we have chosen the strategy of “paraphrasing with prepositions” over “Paraphrasing with
Verbs+Preposition” approach for the task of translation. The chapter explores the possibility of main-
taining one to one correspondence of English-Hindi preposition-postpositions and examines the accu-
racy of translation. The English NCs are paraphrased using 8 prepositions as mentioned in [Lauer 1995].
The task of translating English NCs into Hindi syntactic constructs is divided in two levels

   1. Paraphrasing by Preposition

       (a) Paraphrase the NC using 8 prepositions.
       (b) Paraphrasing candidates are searched in Web corpus.
       (c) An algorithm is devised to determine when the paraphrase is to be ignored and the source
           language NC to be translated as NC or transliterated in NC.

   2. Translation

       (a) English preposition is replaced by Hindi corresponding postposition.
       (b) Bi-lingual dictionary is used to translate the Noun components.

We have manually created a data set of 250 Noun Compounds extracted from BNC corpus for the pur-
pose of evaluation of paraphrasing with preposition method & subsequent translation. The gold standard
data consists of paraphrases of each noun compound and their translation.

   The chapter is divided into following sections. The next section presents a review of related works,
the attempts made for automatic paraphrasing of noun compound. Section 2.4 discusses the method of
preparing gold data for evaluation. Section 2.5 describes the implementation of paraphrasing approach.
The result of paraphrasing is presented in Section 2.6. The schema of mapping English preposition to
Hindi post-position is described in Section 2.7.1. This schema is used for translating English paraphrases
into Hindi constructions. Finally an evaluation result of translation is reported in section 2.7.

2.3    Related Works
   This section surveys various methods of paraphrasing noun compounds as listed in previous section.
Paraphrasing English NCs is a widely studied issue and this section will predominantly study those
works. Scholars (such as [Levi 1978], [Finin 1980]) agree there is a limited number of relations that oc-
cur with high frequency among the constituents of noun compounds. However, the number and the level
of abstraction of these frequently used semantic categories are not agreed upon. They can vary from a

                                                   13
few prepositional paraphrases [Lauer 1995] to hundreds and even thousands more specific semantic re-
lations [Finin 1980]. [Lauer 1995], uses eight prepositions for paraphrasing nominal compounds. They
are: of, for, with, in, on, at, about, and from. For example, the noun compound “bird sanctuary”, can be
interpreted both as “sanctuary of bird” and “sanctuary for bird”.
The automatic interpretation of noun compounds is a difficult task for both unsupervised and supervised
approaches. Currently, the best-performing NC interpretation methods in computational linguistics fo-
cus only on two-word noun compounds and rely either on rather ad-hoc, domain-specific, hand-coded
semantic taxonomies, or on statistical models on large collections of unlabeled data.
The majority of corpus based statistical approaches to noun compound interpretation collects statis-
tics on the occurrence frequency of the noun constituents and uses them in a probabilistic model
([Resnik 1993]; [Lapata & Keller 2004]).

   [Lauer 1995] is the first scholar to devise and test an unsupervised probabilistic model for noun com-
pound interpretation on Grolier encyclopedia, an 8 million word corpus, based on a set of 8 prepositional
paraphrases. His probabilistic model computes the probability of a preposition p given a noun-noun pair
n1-n2 and finds the most likely prepositional paraphrase

                                       p∗ = argmaxP (p|n1, n2)                                        (2.1)

    The ultimate goal of [Lauer 1995] work is to perform semantic analysis of arbitrary noun com-
pounds, although the experiments were done on two word noun compounds that can have prepositional
paraphrase. A probabilistic model for noun compound paraphrasing is developed based on meaning dis-
tributions theory. The model combines information from both head and modifier in order to determine
the semantic relation between them. Compounds can be paraphrased with three possible groups :

   1. Copula : fish that is a tuna

   2. Verbal-nexus compounds or nominalization : construction of buildings

   3. Prepositional Compounds : bars of steel, stories about war

Lauer has experimented with Prepositional Compounds only. He has used Warren’s(1978) study to
construct the list of possible prepositions. The study yielded 7 prepositions ‘of’, ‘for’, ‘in’, ‘at’, ‘on’,
‘from’, ‘with’, but Lauer has included one more preposition about because these preposition constitutes
about 3.6% of total cases. Thus, there were 8 prepositions that Lauer has used in this experiment for
paraphrasing.
    Lauer has addressed the following issue in his work: “Given a Noun Compound which preposition
among the list of 8 prepositions is most likely to occur as a paraphrase of the compound.” To build a
statistical learner for this problem he has given a probabilistic model that compute the probability of 8

                                                    14
prepositions and gives the most likely one. The function is defined as

                      (∀n1 , n2 ∈ N )(A(n1 , n2 ) = argmaxp∈P P r(p|n1 , n2 ))                              (2.2)
With assumptions such as:

   1. Probability of modifier is independent of the head.

   2. Each role/preposition is equally likely.

We get the following equation.
                                                                X
      (∀n1 , n2 ∈ N )(A(n1 , n2 ) = argmaxp∈P                                      P r(c1 |p)P r(c2 |p)))   (2.3)
                                                       c1 ∈φ1 (n1 );c2 ∈φ1 (n2 )

where φ is the syntactic mapping from concepts (c1 , c2 ∈C)toN ouns(n1 , n2 ). The steps to build the
paraphrasing technique are the following:

   1. Build a gold data set of two word noun compounds with their correct paraphrase.

   2. Define a pair of patterns for extracting examples of the nouns modified by, and the nouns governed
      by, eight prepositions.

   3. Estimate P r(c1 |p) and P r(c2 |p) for each preposition.

   4. Use the distribution to estimate the best preposition for paraphrasing.

This method has resulted in an accuracy of 40% on the test set of 400 Noun Compounds developed by
Lauer. We have used the same 8 prepositions for paraphrasing on our test data of Noun Compounds.

[Lapata & Keller 2004] has demonstrated that web counts can be used to approximate bigram frequen-
cies. They have applied this method in Interpretation of Compounds Nouns. For majority of tasks,
they have found that n-gram frequencies obtained from web are better than that of obtained from large
corpus. Keller and Lapata has shown in their previous work (’03) that the web frequencies

   1. correlate with the frequencies obtained from edited BNC corpus(100M words).

   2. correlate with the frequencies recreated using smoothing methods.

   They have used the model proposed by Lauer
                                             X
                      p∗ = argmaxp                                     P r(t1 |p)P r(t2 |p)                 (2.4)
                                         t1 ∈cats(n1 );t2 ∈cats(n2 )

where t1 and t2 represents concepts. This model is tested by Lauer on both concept level and Lexical
level, however, in this chapter they worked on lexical version. Unlike using normal queries, they have
preferred using augmented queries and thus generated three types of queries :

                                                      15
1. Literal Queries : use the quoted n-gram directly

   2. Near Queries : Use of a NEAR b type queries to expand the n-gram where NEAR stands for a
      window of 10 words.

   3. Inflected Queries : that uses all morphological forms in an expanded query. e.g. “history change”
      can be expanded as “histories change”, “history changed” ....
Lapata and Keller has shown that usage of Web as a target corpus and of inflected queries can certainly
improve the performance of the system. Their system return an accuracy of 55.71% when search for
the queries is done on Altavista search engine. Since this method has shown a great improvement in
paraphrasing accuracy and has established a correlation between web counts and BNC corpus, we have
borrowed [Lapata & Keller 2004] method for paraphrasing with some additional features.
[Nakov & Hearst 2005] has shown that a hidden noun-noun semantic relation can be expressed explic-
itly with the help of Web search engine queries. Queries are those construction that expresses a Noun
Compound as a relative clause containing wild card characters to be filled with a verb. For example,
a compound like tear gas can be interpreted as gas that causes tears, gas that brings tears or gas that
produces tears. For extracting verbs from these type of paraphrases they have devised the following
queries and have used Google to extract the snippets:
gas THAT * tear where, THAT stands for ‘that’, ‘which’, ‘who’.
The query can be generalized into: n2 that * n1 where n1 n2 are the noun components and ‘*’ stands for
one to eight words. Issuing this query they have retrieved first 1000 snippets and have considered only
those snippets for which the sequence of words following n1 is non-empty and contains at least one
non-noun. Then they have shallow parsed the snippets to extract the verbs and the following preposi-
tion. They have also ensured that there is only one verb phrase between verb and n1 to disallow complex
clauses.

    Another work that subsequently overlaps with paraphrasing is that of [Kim & Baldwin 2006]. They
have worked on interpretation of Noun Compound via Verb Semantics. They have developed an auto-
matic method for interpreting NC based on semantic relations. A set of seed verbs is used to represent a
single semantic relation; for example NC within the sentence The bald headed guy owns the Mercedes
is taken as an evidence of POSSESSOR relation. Their work can be divided in number of steps: First, 3
huge corpora (BNC, WSJ and Brown) are taken. Sentences are parsed using RASP parser. Second, sen-
tences are filtered by selecting only those one which have “H”(Head) and “M”(Modifier). Third, verbs
are extracted from the sentence and are mapped to the seed words using Wordnet::Similarity Measure
and thus generate the templates like

                               S(have, own, possessV , M SU BJ , H OBJ )                            (2.5)

where V is the set of seed words. Fourth, a set of semantic relations is retrieved from which they select
the best-fitting relation using TiMbl Classifier.

                                                   16
There exists no work which has attempted to translate nominal compound by paraphrasing them first
and translating the paraphrases in target language next. In this regard, the approach proposed is first of
its kind to be attempted. We will describe our method of paraphrasing in Section 2.5.

2.4    Data
   We have created two data sets for evaluating our algorithm. One set contains 218 nominal compounds
which originally occurs in Lauers test set of 400 noun compounds [Lauer 1995]. We have taken this data
to compare our system with that of Lauers using his data set. To evaluate the quality of paraphrasing
& translation of NC together we have created our own dataset since no reference data is available for
such task. The second test set consists of 250 bigram noun compounds which are extracted from BNC
corpus [Lou Burnard 2000]. This data is manually paraphrased on source side and translated into Hindi
for evaluating both paraphrasing and translation techniques. BNC has varied amount of text ranging
from news- paper article to letters, books etc.

2.5    Paraphrasing of Noun Compounds
   This section describes the procedure of paraphrasing by preposition that we have adopted. The
system is comprised of the following stages:

   1. Generating prepositional paraphrasing candidates for English NC using brute force.

   2. Giving paraphrase as a query in Web search

   3. Filtering the result using the heuristics

   4. Selection of the best paraphrase.

Each noun compound is paraphrased with all eight prepositions under consideration. Our examination
of parallel corpora (for details see Chapter 3) has revealed the fact that English NC remains a NC in
Hindi for 40% times. This observation motivates us to design an algorithm that will determine whether
an English NC is to be translated as an analytic construct or retained as an NC in Hindi. We have
used Yahoo! search engine for querying “Head Preposition Modifier” in the web for a given input NC
(“Modifier Head”) to get the frequency of the paraphrase. For example, paraphrases for the NC “finance
minister” and their frequency are given in Table 2.1:
    In the above table we notice that the distribution is widely varied. For some paraphrase the count
is very low (for example ‘minister about finance’ has count 2) while the highest count is 5420000 for
‘minister of finance’. The wide distribution is apparent even when the range is not that high as illustrated
in Table 2.2 :

                                                    17
Paraphrase                Web Frequency
                             minister about finance                2
                             minister from finance                16
                             minister on finance               34300
                             minister for finance            1370000
                             minister with finance                43
                             minister by finance                  20
                             minister to finance                 508
                             minister in finance                 335
                             minister at finance                  64
                             minister of finance             5420000

        Table 2.1 Frequency of Paraphrases for “finance minister” resulted from Web search.

                            Paraphrase                 Web Frequency
                            agencies about welfare                 1
                            agencies from welfare                 16
                            agencies on welfare                   64
                            agencies for welfare                 707
                            agencies with welfare                 34
                            agencies in welfare                  299
                            agencies at welfare                    0
                            agencies of welfare                   92

        Table 2.2 Frequency of Paraphrases for “welfare agencies” resulted from Web search.

   During our experiment we have come across three typical cases: (a) No paraphrase is available when
searched; (b) Frequency counts of some paraphrases for a given NC is very low and (c) Frequency of
a number of paraphrases cross a threshold limit. The threshold is set to be mean of all the frequencies
of paraphrases. Each of such cases signifies something about the data and we build our translation
heuristics based on these observations. When no paraphrase is found in web corpus for a given NC, we
consider such NCs very close-knit constructions and translate them as noun compound in Hindi. This
generally happens when the NC is a proper noun or a technical term. Similarly when there exists a
number of paraphrases each of those crossing the threshold limit, it indicates that the noun components
of such NCs can occur in various contexts and we select the first 3 paraphrase as probable paraphrase
of NCs for such cases. For example, the threshold value for the NC finance minister is: Threshold =
6825288/8 = 853161. The two paraphrases considered as probable paraphrase of this NC are therefore
“minister of finance” and “minister for finance”. The remaining paraphrases are ignored. When count
of a paraphrase is less than the threshold, they are removed from the data. We presume that such low
frequency does not convey any significance of paraphrase. On the contrary, they add to the noise for

                                                  18
probability distribution. For example, all paraphrases of “antelope species” except “species of antelope”
is very low as shown in Table 2.3. They are not therefore considered as probable paraphrases.

                             Paraphrase                 Web Frequency
                             species about antelope                 0
                             species from antelope                 44
                             species on antelope                   98
                             species for antelope                   8
                             species with antelope                 10
                             species in antelope                    9
                             species at antelope                    8
                             species of antelope                60600

              Table 2.3 Frequency of Paraphrases for antelope species after Web search.

2.6    Experiments & Results
   For a given NC we have used a brute force method to find the paraphrase structure. We have used
Lauer’s prepositions (of, in, about, for, with, at, on, from, to, by) for prepositional paraphrasing. Web
search is done on all paraphrases and frequency counts are retrieved. Mean frequency (F) is calculated
using all frequencies retrieved. All those paraphrases that give frequency more than F are selected. We
have first tested the algorithm on 250 test data of our selection. The result of the top three paraphrases
are given below :

                                   Selection Technique      Precision
                                   Top 1                       61.6%
                                   Top 2                      67.20%
                                   Top 3                       71.6%

                                   Table 2.4 Paraphrasing Accuracy

   We have also tested the algorithm on Lauer’s test data (first 218 compounds out 400 of NCs) and have
obtained the following results (see Table 2.5 ). Each of the test data is paraphrased with a preposition
which best explain the relationship between the two noun components. Lauer gives X for compounds
which cannot be paraphrased by using prepositions For eg. tuna fish.
   OLauer : Number of occurrence of each preposition in Lauer test data
OCI : Number of correctly identified preposition by our method
   In Table 2.6 we compare our result with that of Lauer’s on his data. We have given the results
maintaining the following criteria:

                                                   19
Prep    OLauer    OCI     Percentage
                                 Of      54        37      68.50%
                                 For     42        20      47.62%
                                 In      24        9       37.50%
                                 On      6         2       33.33%

                  Table 2.5 Distribution of Preposition on Lauer test data of 218 NC

   1. only “N prep N” is considered.

   2. Non-Prepositions (X) are also considered which are referred to as ‘All’ in the Table 2.6.

                                   Case       Our Method      Lauer’s
                                 N-prep-N       43.67%        39.87%
                                    All          42.2%         28.8%

                    Table 2.6 Comparison of our approach with Lauer’s Approach

2.7     Translation of Noun Compounds
    For translation of Noun Compounds we first paraphrase the compounds with preposition as described
in Section 2.5. The paraphrases are then translated into Hindi.
In many cases English prepositions can be semantically overloaded. For example, the NC “Hindu
law” can be paraphrased as “law of Hindu”. This paraphrase can mean “Law made by Hindu” (not
for Hindu people alone though) or “Law meant for Hindu” (law can be made by anyone, not by the
Hindus necessarily). Such resolution of meaning is not possible from “preposition paraphrase”. The
chapter argues that this is not an issue from the point of view of translation at least. It is because
the Hindi correspondent of ‘of’, which is ‘kA’, is equally ambiguous. The translation of “Hindu law”
is “hinduoM kA kAnUn” and the construction can have both aforementioned interpretations. Human
users can select the right interpretation in the given context. The next subsection describes the mapping
of English preposition to Hindi post-position in details.

2.7.1   Mapping English Preposition to Hindi Post-position
   The strategy of mapping English preposition to Hindi post-position is a crucial one for the present
task of translation. The decision is mainly motivated by a preliminary study of aligned parallel corpora
of English and Hindi in which we have observed the following distribution of translation probabilities
of Lauer’s 8 prepositions as shown in Table 2.7.

                                                   20
Prep     Post-Pos          Sense         Prob.
                                      kA                Possession    0.13
                                      ke                Possession    0.574
                             of
                                      kI                Possession    0.29
                                      se                Possession    0.002
                             from     se                Source        .999
                                      meM               Location      0.748
                             at
                                      par               Location      .219
                                      se                Instrument    0.628
                             with
                                      ke sAtha          Association   0.26
                                      par               Loc./Theme    0.987
                             on
                                      ko                Theme         0.007
                             about    ke bAre meM       Subj.Matter   0.68
                             in       meM               Location      .999
                                      ke lie            Beneficiary   0.72
                             for
                                      ke                Possession    0.27

Table 2.7 Mapping of English Preposition to Hindi postposition from aligned English-Hindi parallel
corpora.

    Table 2.7 shows that English prepositions are mostly translated into one Hindi postposition except
for a few cases such as “at”, “with” and “for”. The probability of “on” getting translating into “ko” and
“of” into “se” is very less and therefore we are ignoring them in our mapping schema. The preposition
“at” can be translated into “meM” and “para” and both postpositions in Hindi can refer to “location”.
However, the two prepositions “with” and “for” can be translated into two distinct relations as shown in
Table 2.7. From our parallel corpus data, we therefore find that these prepositions are semantically over-
loaded from Hindi language perspective. The right sense and thereafter the right Hindi correspondent
can be selected in the context. In the present task, we are selecting the mapping with higher probability.
English Prepositions are mapped to one Hindi Post-position for all cases except for “at” and “about”.
The final correspondence as used in the present work is given in Table 2.8.
    Hindi post-positions can be multi-word as in “ke bAre meM”, “ke liye” and so on as shown in Table
2.8. For the present study, lexical substitution of head noun and modifier noun are presumed to be cor-
rect.

2.8    Translation of Noun Compounds: Experiments and Result
   In this section we check for the accuracy of translation of English paraphrases of NCs into Hindi.
For this task we have used the gold standard paraphrase data of 250 noun compounds which we have
manually created. The same set of noun compound is given to Google translator for translation. We

                                                   21
You can also read