The brief view on Google Translate Machine

Page created by Jeffery Gonzales
 
CONTINUE READING
The brief view on Google Translate Machine
1

                The brief view on Google Translate Machine
                                 Omid Karami, Student of Vienna University of Technology

                                                                        A. Rule-Based MT
  Abstract—I have tried to describe briefly about
                                                                         Rule-Based Machine Translation (RbMT) systems use
Google machine translate history and some of the
                                                                      large collections of rules, manually developed over time
methods are used in it. In the following paper, I have
described RbMT and SMT strategy, which are two                        by human experts, which map structures from the source
major Machine translate technique.           One of                   to the target language.
modeling translate is presented in naïve way and also                    The software parses text and creates a transitional
two decoding algorithms. I have also mentioned some                   representation from which the text in the target language
basic formulas which are used widely on statistical                   is generated. This process requires extensive lexicons
method.                                                               with     morphological,      syntactic,   and     semantic
                                                                      information, and large sets of rules. The software uses
                                                                      these complex rule sets and then transfers the
 Index Terms— Google MT, Machine translate,                           grammatical structure of the source language into the
MT, RbMT, SMT.                                                        target language.
                                                                         Translations are built on gigantic dictionaries and
                                                                      sophisticated linguistic rules. Users can improve the out-
                    I. INTRODUCTION                                   of-the-box translation quality by adding their
    HE Machine translation (MT) is automated                          terminology into the translation process. They create
Ttranslation. It is the process by which computer                     user-defined dictionaries which override the system’s
software is used to translate a text from one natural                 default settings.
language (like English) to another (like Germany).                    In most cases, there are two steps: an initial investment
   To process any translation, human or automated, the                that significantly increases the quality at a limited cost,
meaning of a text in the original (source) language must              and an ongoing investment to increase quality
be fully restored in the target language. However this                incrementally. While rule-based MT brings companies
seems straightforward, it is really complex. Translation              to the quality threshold and beyond, the quality
is not a just word-for-word substitution. A translator                improvement process may be long and expensive.
must interpret and analyze all of the elements in the text
                                                                        B. Statistical MT
and know how each word may influence another. This
requires extensive expertise in grammar, syntax                          Statistical Machine Translation (SMT) systems use
(sentence structure), semantics (meanings), etc., in the              computer algorithms that explore millions of possible
source and target languages, as well as familiarity with              ways of putting smaller pieces of text together, in an
each local region.                                                    effort to produce a translation that looks best. It utilizes
   The challenge between Human and machine                            statistical translation models whose parameters stem
translation is to improve quality of machine translation              from the analysis of monolingual and bilingual corpora.
to produce publishable quality translations, quality                  Building statistical translation models is a quick process,
translations..                                                        but the technology relies heavily on existing
   There are three common machine translation                         multilingual corpora. A minimum of 2 million words for
technologies in commercial use today:                                 a specific domain and even more for general language
                                                                      are required. Theoretically it is possible to reach the
                                                                      quality threshold but most companies do not have such
                                                                      large amounts of existing multilingual corpora to build
                                                                      the necessary translation models. Additionally,
   Omid Karami, Student of Vienna University of Technology; e-mail:   statistical machine translation is CPU intensive and
e1129944@ student.tuwien.ac.at).
                                                                      requires an extensive hardware configuration to run
2

translation models for average performance levels.            translation accuracy will sometimes vary across
                                                              languages [1].
  C. Hybrid MT
                                                              The history of Google translate machine start from 2001
   In order to address both quality and time-to-market        based on rule-based MT at first It contain just six
limitations, many rule-based machine translation              language English, France, German, Italian, Portuguese
developers are augmenting their core technology with          and Spanish. (English to others). Then from 2004
statistical machine translation technology in what is         Chinese, Japanese and Korean are added. In
referred to as ‘Hybrid’ machine translation.                  development time since 2006 Google is decided to
   Rule-based MT provides good out-of-domain quality          Statistic MT and start to add new languages Arabic and
and is by nature predictable. Dictionary-based                Russian by this model. Furthermore since 2006, Google
customization guarantees improved quality and                 Translate has used proprietary, in-house technology
compliance with corporate terminology. But translation        based on statistical machine translation instead. Since
results may lack the fluency readers expect. In terms of      2007, from the result of SMT molding decided to
investment, the customization cycle needed to reach the       replace all of rule-based engine with statistic version
quality threshold can be long and costly. The                 and now all of language use SMT.
performance is high even on standard hardware.                The core algorithm to make 'Google Translate - Machine
                                                              Translation' works is statistical machine translation
   Statistical MT provides good quality when large and        (SMT). SMT uses statistical model to determine the
                                                              word translation. This basic method doesn't follow any
qualified corpora are available. The translation is fluent,
                                                              language translation rules.
meaning it reads well and therefore meets user
                                                              To make statistical model, we need bilingual text
expectations. However, the translation is neither
                                                              corpora/corpus. Bilingual text corpus is a database of
predictable nor consistent. Training from good corpora
                                                              source sentences and target sentences. For example if
is automated and cheaper. But training on general             we want to build statistical model for English to Spain
language corpora, meaning text other than the specified       translation, we need a database of English sentences and
domain, is poor. Furthermore, statistical MT requires         Spain translated sentences. The more sentences the
significant hardware to build and manage large                better statistical model we have.
translation models.
                                                              Computer will be trained to calculate probability word
                                                              distribution statistic from above sentences. For example
           II. GOOGLE TRANSLATE MACHINE                       if word AAA has probability 80% to be translated into
Google Translate (GT) is a popular translation service        BBB, then we confident that AAA can be translated into
provided by Google to translate a word, a phrase, a           BBB.
section of text or an entire web page into one of 51          Since it doesn't rely on any linguistic rule, SMT can be
languages mentioned below. Google translator cannot           used to make translation any pair languages. Although it
only translate words and sentences, but also translate        need times to make bilingual language corpora, but the
pages, books, and even an entire website.                     result is much better than ruled-based translation.
The stated goal of Google Translate is to make
information universally accessible and useful, regardless
of the language in which it's written. When Google
Translate generates a translation, it looks for patterns in
hundreds of millions of documents to help decide on the
best translation. By detecting patterns in documents that
have already been translated by human translators,
Google Translate can make intelligent guesses as to
what an appropriate translation should be. This process
of seeking patterns in large amounts of text is called
"statistical machine translation". Since the translations
are generated by machines, not all translation will be
perfect. The more human-translated documents that
Google Translate can analyze in a specific language, the
                                                                Figure 1 - basic of creating SMT language model
better the translation quality will be. This is why
3

From Figure1, first step is collecting many documents        Arabic, and Urdu, have different words other than
from many sources. Then system will align sentences          English, but their words may sound like certain terms in
and create database of pair sentences (bilingual text        English.
corpus).
System will be trained using that corpus. It will analyze
the statistic of word distribution in each sentence. The
output of this training is language model. Each pair                III. STATISTICAL MACHINE TRANSLATION
translation has their own language model. Language
model will be updated each time the system learn new
                                                                Statistical machine translation is based on a channel
corpus.
                                                             model. Given a sentence T in one language (German) to
Using this language model we can translate other
sentences.                                                   be translated into another language (English), it
                                                             considers T as the target of a communication channel,
  A. Bilingual text corpus                                   and its translation S as the source of the channel. Hence
   We know that Google Translate supports many pair          the machine translation task becomes to recover the
language translations. Google gathers bilingual text         source from the target. Basically every English sentence
corpus from many documents. They scan the original           is a possible source for a German target sentence. If we
version books and the translated version. They crawl         assign a probability P(S|T) to each pair of sentences (S,
websites which have two or more language versions.           T), then the problem of translation is to find the source S
Sometimes they hire translators to translate from one        for a given target T, such that P(S|T) is the maximum.
language to other language.                                  According to Bayes rule,
   After they have bilingual documents, Google do word
alignment. They have software that can align source                           P( S ) P(T | S )
sentences and translated sentences. This software               p(S | T ) =                      (1)
                                                                                   P(T )
creates database pair of source sentences and translated
sentences.
                                                               Since the denominator is independent of S,
                                                               Sˆ = arg max P( S ) P(T | S )      (2)
  B. Benefit of SMT                                                     S

   SMT have benefits over traditional translation method      Therefore a statistical machine translation system
(e.g: rule based translations):                              must deal with the following three problems:

  • Generally SMT translator is not tailored to support            • Modeling Problem: How to depict the process
     specific languages. It builds to support many pair               of generating a sentence in a source language,
     of languages so SMT have better use of resources.                and the process used by a channel to generate
     It means building SMT translator is cheaper than                 a target sentence upon receiving a source
     traditional method.                                              sentence? The former is the problem of
  • Depending on the number of bilingual of corpus,                   language modeling, and the latter is the
     SMT translator gives more natural translations.                  problem of translation modeling. They
     The more bilingual corpus it has, the more                       provide a framework for calculating P(S) and
     translator trained with new bilingual corpus, the                P(T|S) in (2) [4].
     more natural translation it has.                              • Learning Problem: Given a statistical language
                                                                      model P(S) and a statistical translation model
                                                                      P(T|S), how to estimate the parameters in
  While there are many machine translation software on                these models from a bilingual corpus of
the internet, Google translator is clearly in the front of            sentences?
the pack. One of Google automatic translator’s clear               • Decoding Problem: With a fully specified
advantages is the phonetic typing.                                    (framework and parameters) language and
  Google translator allow user to translate more than                 translation model, given a target sentence T,
just Latin based languages by enabling a web based                    how to efficiently search for the source
phonetic keyboard right on the translator. Many                       sentence that satisfies (2).
languages such as Russian, Greek, Hindu, Serbian,
4

   Some of the most important modeling and learning                                                  l    m

issues are used in a statistical machine translate like                P( g | e) = ε ∑ ... ∑∏ t ( g i | eaj )a(a j | a j | j , l , m)
                                                                                           a1= 0   am = 0 j =1
Google Translate mentioned as follow: it starts with
                                                                               m     l
basic Probability and continue with sums and products,                   = ε ∏∑ t ( g i | ei )a (i | j , l , m)         (5)
the noisy channel, Bayesian Reasoning, word                                   j =1 i = 0
Reordering, word choice, language modeling, N-grams,
Smoothing, Evaluating models, Perplexity, log                                                            IV. DECODING
probability arithmetic, translation modeling, translation
                                                                          Decoding algorithm in statistical machine translation
as string rewriting, model 2 (language model), model 3,
                                                                       is a crucial part. Its performance directly affects the
models parameters, word to word alignments, estimating
                                                                       quality and efficiency of translation. Without a good and
parameter values for w-t-w alignments, bootstrapping,
                                                                       efficient decoding algorithm, a statistical machine
all passible alignments, collecting fractional counts,
                                                                       translation system may miss the best translation of an
alignment probabilities, decoding, efficient model
                                                                       input sentence even if it is perfectly predicted by the
training and some so on, in this paper I have chosen a
                                                                       model.
few of these methods and tried to describe them in naïve
way.
                                                                         A. Stack decoders
  A. Model 2                                                             Stack decoders are widely used in speech recognition
                                                                       systems. The basic algorithm can be described as
At this model, it receives a source English sentence e
                                                                       following:
=    e =e 1      e
            ,..., l the channel generates a German
sentence       g=    g ,..., g       at the target end in the            1) Initialize the stack with a null hypothesis.
                       1         m                                       2) Pop the hypothesis with the highest score off the
following way:                                                                stack, name it as current-hypothesis.
1. With a distribution P(m|e), randomly choose the                       3) if current-hypothesis is a complete sentence, output
    length m of the German translation g. In model                            it and terminate.
    2, the distribution is independent of m and e:                       4) Extend current-hypothesis by appending a word in
                        P ( m | e) = ε         (3)                            the lexicon to its end. Compute the score of the
                                                                              new hypothesis and insert it into the stack. Do this
    where e is a small, fixed number.                                         for all the words in the lexicon.
                                                                         5) Go to (2).
2. For each position i (0 < i ≤ m) in g, find the
    corresponding position ai in e according to an                       B. Scoring the hypotheses
    alignment distribution
                                                                          In stack search for statistical machine translation, a
    P(ai | i, a1i −1 , m, e) .                                         Hypothesis H includes (a) the length l of the source
                                                                       sentence, and (b) the prefix words in the sentence. Thus
    In model 2, the distribution only depends on i, ai and             a hypothesis can be written as H = l : e1 e2 ...ek , which
    the length of the English and German sentences:
                                                                       postulates a source sentence of length l and its first k
     P(ai | i, a1i −1 , m, e) = a(ai | i, m, l )     (4)               words. The score of H, fit, consists of two parts: the
                                                                       prefix score g H for e1e2 ...ek , and the heuristic score
3. Generate the word gi at the position i of the German
                                                                       hH for the part ek +1ek + 2 ...el that is yet to be appended
    sentence from the English word eai at the aligned
    position ai of gi, according to a translation                      to H to complete the sentence. From (3) can be used to
                                                                       assess a hypothesis. Although it was obtained from the
    distribution P ( g i | aim , g1i −1 , e) = t ( g i | eai ) . The
                                                                       alignment model, each word ei in the hypothesis
    distribution here only depends on gi and eai.                      contributes the probability of the target sentence word.
                                                                       For each hypothesis, we use SH(j) to denote the
Therefore, P(g | e) is the sum of the probabilities of                 probability mass for the target word gl contributed by the
generating g from e over all possible alignments A, in                 words in the hypothesis:
which the position i in the target sentence g is aligned to
the position ai in the source sentence e [4]:
5

                k
                                                                 Each foreign phrase g in g1I is translated into an
S H ( j ) = ε ∑ t ( g i | ei )a (i | j , l , m)   (6)
               i =0                                            English phrase ei. The English phrases may be
                                                               reordered.
To guarantee an optimal search result, the heuristic             Phrase translation is modeled by a probability
function must be an upper-bound of the score for all           distribution ϕ ( g i | ei ) .
                            e e        ...e                      Recall that due to the Bayes rule, the translation
possible extensions k +1 k + 2 l of a hypothesis. In
other words, the benefit of extending a hypothesis             direction is inverted from a modeling standpoint.
should never be underestimated. Otherwise the search             Reordering of the English output phrases is modeled
algorithm will conclude prematurely with a non-optimal         by a relative distortion probability distribution
hypothesis.                                                    d (ai − bi − 1), where ai denotes the start position of the
On the other hand, if the heuristic function
                                                               foreign phrase that was translated into the ith English
overestimates the merit of extending a hypothesis too
much, the search algorithm will waste a huge amount of         phrase, and bi − 1 denotes the end position of the foreign
time after it hits a correct result to safeguard the           phrase translated into the (i-1)th English phrase.
optimality.                                                      In all our experiments, the distortion probability
Due to physical space limitation, we cannot keep all           distribution d(.) is trained using a joint probability
hypotheses alive. There is a possibility to set a constant     model. Alternatively, there is also possibility to use a
M, and whenever the number of hypotheses exceeds M,
                                                               simpler distortion model d (a i − bi − 1) = a |ai −bi −1 −1| with
the algorithm will prune the hypotheses with the lowest
scores. In an experiments the authors decided to set M =       an appropriate value for the parameter α.
20,000.                                                          In order to calibrate the output length, we introduce a
There was time limitation too. It was of little practical      factor W for each generated English word in addition to
interest to keep a seemingly endless search alive too          the trigram language model PLM. This is a simple means
long.                                                          to optimize performance. Usually, this factor is larger
Since the heuristic function overestimates the merit of        than 1, biasing longer output.
extending a hypothesis, the decoder always prefers
                                                                 In summary, the best English output sentence g best
hypotheses of a long sentence, which have a better
                                                               given a foreign input sentence according to our model is:
chance to maximize the likelihood of the target words.
The decoder will extend the hypothesis with large I first,        ebest = arg max e p (e | g )
and their children will soon occupy the stack and push            ebest = arg max e p ( g | e) p LM (e) w legth ( e )
the hypotheses of a shorter source sentence out of the
stack. If the source sentence is
a short one, the decoder will never be able to find it, for      wehre p(g|e) is decomposed into
                                                                                     I
the hypotheses leading to it have been pruned                     p( g1I | e1I ) = ∏ ϕ ( g i | ei )d (a i − bi −1 )     (8)
permanently.                                                                        i =1

                      V. PHRASE TRANSITION
   The phrase translation model is based on the noisy                              VI. PHRASE DECODER
channel model. We use Bayes rule to reformulate the            The phrase-based decoder employs a beam search
translation probability for translating a foreign sentence     algorithm, similar to the one by [6]. The English output
into English e as                                              sentence is generated left to right in form of partial
                                                               translations (or hypotheses).
  arg max e p(e | g ) = arg max e p( g | e) p(e)        ( 7)   It starts with an initial empty hypothesis. A new
                                                               hypothesis is expanded from an existing hypothesis by
   This allows for a language model p(e) and a separate        the translation of a phrase as follows: A sequence of
translation model p(g|e).                                      untranslated foreign words and a possible English
                                                               phrase translation for them is selected. The English
   During decoding, the foreign input sentence g is
                                                               phrase is attached to the existing English output
segmented into a sequence of I phrases g1I . A uniform         sequence. The foreign words are marked as translated
probability distribution over all possible segmentations       and the probability cost of the hypothesis is updated.
can be assume.
6

The cheapest (highest probability) final hypothesis with       technique is efficient and easier to implement but in
no untranslated foreign words is the output of the             such case like Google MT, which is Support large
search.                                                        Varity of languages. There is ability to use large amount
The hypotheses are stored in stacks. The stack                 of data as train data for different language, It’s trivial to
 s m contains all hypotheses in which m foreign words          use SMT as engine method. Actually In this area Google
have been translated. Then recombine search hypotheses         research are very active and we can see the acceptable
as done by [7]. While this reduces the number of               result of Google MT. I think in near feature we can
hypotheses stored in each stack somewhat, stack size is        higher accuracy in Google MT.
exponential with respect to input sentence length. This
makes an exhaustive search impractical.                                               REFERENCES
Thus, we prune out weak hypotheses based on the cost           [1]   http://translate.google.com/about/intl/en_ALL/.
they incurred so far and a future cost estimate. For each      [2]   http://www.statmt.org/.
stack, it can be keep only a beam of the best n                [3]    http://cseweb.ucsd.edu/~dkauchak/mt-tutorial/.
hypotheses.                                                    [4]   Brown et al, “The Mathematics of Statistical
Since the future cost estimate is not perfect, this leads to         Machine Translation”, Computational Linguistics,
search errors. Our future cost estimate takes into account           1993.
the estimated phrase translation cost, but not the             [5]   http://www.systransoft.com.
expected distortion cost.                                      [6]   Jelinek, F . "Statistical Methods for Speech
It computes this estimate as follows: For each possible              Recognition" The MIT Press 1998.
phrase translation anywhere in the sentence, we multiply       [7]   Och, F . J., Ueffing, N., and Ney , H. "An efficient
its phrase translation probability with the language                 A* search algorithm for statistical machine
model probability for the generated English phrase. As               translation" In DataDriven MT Workshop 2001.
language model probability can be use the unigram              [8]   .M. Collins. 1999. Head-driven Statistical Models
probability for the first word, the bigram probability for           for Natural Language P arsing. Ph.D. thesis,
the second, and the trigram probability for all following            University of Pennsylvania, Philadelphia.
words.
Given the costs for the translation options, it can
compute the estimated future cost for any sequence of
consecutive foreign words by dynamic programming.
During translation, future costs for uncovered foreign
words can be quickly computed by consulting this table.
If a hypothesis has broken sequences of untranslated
foreign words, we look up the cost for each sequence
and take the product of their costs.
The beam size, e.g. the maximum number of hypotheses
in each stack, is fixed to a certain number. The number
of translation options is linear with the sentence length.
Hence, the time complexity of the beam search is
quadratic with sentence length, and linear with the beam
size.

                   VII. CONCLUSION
   I have tried to discover topic about Google machine
translate, which is contained a large area of research in
Artificial intelligence. I consider two major technique
which are used to implement an engine of a MT. each
technique has its advantages and disadvantages.
Regarding Latin’s language or languages which are
simple in their linguistic and rules, the rule-base
You can also read