IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP

Page created by Douglas Hart
 
CONTINUE READING
IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP
IAAA / PSTALN
                           Machine translation

                Benoit Favre 

                        Aix-Marseille Université, LIS/CNRS

                     last generated on January 20, 2020

Benoit Favre (AMU)                PSTALN: LM/MT              January 20, 2020   1 / 35
IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP
Definition

   What is machine translation (MT)?
     ▶   Write a translated version of a text from a source to a target language
     ▶   Word, sentence, paragraph, document-level translation
   Formalization
     ▶   x = x1 . . . xn : sequence of words in the source language (ex: Chinese)
     ▶   y = y1 . . . ym : sequence of words in the target language (ex: English)
     ▶   Objective find f such that y = f (x)
   Why is it hard?
     ▶   Non-synchronous n to m symbol generation
     ▶   One-to-many / many-to-one word translation
     ▶   Things move around
     ▶   Some phrases do not translate
   Yet people are quite good at it
     ▶   But learning a new language takes a lot of effort

  Benoit Favre (AMU)             PSTALN: LM/MT                  January 20, 2020   2 / 35
IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP
History of MT

   1950 : Development of Computers/AI in the West is driven by idea of
   translating from Russian to English
     ▶   Link with cryptography
   1960-1980: Reduced domains
     ▶   Bilingual dictionaries + rules to order words
   1980-2000: Statistical approaches
     ▶   Translate from examples through statistical models
   2000-2010: Translate speech
     ▶   DARPA projects: high volume article/blog translation, dialogues with
         translation
   2010+: Neural machine translation
     ▶   Neural language model rescoring
     ▶   Sequence-to-sequence decoding
     ▶   Attention mechanisms

  Benoit Favre (AMU)              PSTALN: LM/MT               January 20, 2020   3 / 35
The translation pyramid

                                   interlingua

  like(person(me), edible(soup))                 aimer(personne(moi), comestible(soupe))
                                   sémantique
     subject-verb-object                               sujet-verbe-complément
         I                         syntaxique                     Je
         to like                                                  aimer
         soup                                                     la soupe
                                    lexical

        Source                                                    Cible
  phrase en anglais                                        phrase en français
    "I like soup"                                           "J'aime la soupe"

   Benoit Favre (AMU)                  PSTALN: LM/MT                     January 20, 2020   4 / 35
Machine translation (the legacy approach)

Definitions
    source: text in the source language (ex: Chinese)
    target: text in the target language (ex: English)
Phrase-based statistical translation
    Decouple word translation and word ordering

                                       P(source|target) × P(target)
                  P(target|source) =
                                                P(source)

Model components
    P(source|target) = translation model
    P(target) = language model
    P(source) = ignored because constant for an input

    Benoit Favre (AMU)            PSTALN: LM/MT               January 20, 2020   5 / 35
Language model (LM)
  Objective
     ▶   Find function that ranks a word sequence according to its likelihood of
         being proper language
     ▶   Compute probability of text to originate from a corpus

         P(w1 . . . wn ) = P(wn |wn−1 . . . w1 )P(wn−1 . . . w1 )
                        = P(wn |wn−1 . . . w1 )P(wn−1 |wn−2 . . . w1 )
                                 ∏
                        = P(w1 )   P(wi |wi−1 . . . w1 )
                                   i

           P(le chat boit du lait) =P(le)
                                        × P(chat|le)
                                        × P(boit|le chat)
                                        × P(du|le chat boit)
                                        × P(lait|le chat boit du)
  Benoit Favre (AMU)              PSTALN: LM/MT                     January 20, 2020   6 / 35
N-gram LM
  Apply Markov chain limited-horizon approximation

      P(mot(i)|historique(1, i − 1)) ≃P(mot|historique(i − k, i − 1)
                       P(wi |w1 . . . wi−1 ) ≃P(wi |wi−k . . . wi−1 )

  For k = 2

      P(le chat boit du lait) ≃P(le) × P(chat|le) × P(boit|le chat)
                                      × P(du|chat boit) × P(lait|boit du)

  Estimation
                                              nb(le chat boit)
                        P(boit|le chat) =
                                               nb(chat boit)

  N-gram LM (n = k + 1), uses n words for estimation

  Benoit Favre (AMU)               PSTALN: LM/MT                   January 20, 2020   7 / 35
LM Smoothing
  Example, bigram model (2-gram) :

  P(la chaise boit du lait) = P(la) × P(chaise|la) × P(boit|chaise) × . . .

  How to deal with unseen events
  Method of pseudo-counts (Laplace smoothing) (N = number of
  simulated events)

                                                 nb(chaise boit) + 1
                       Ppseudo (boit|chaise) =
                                                   nb(chaise) + N

  Interpolation methods

  Pinterpol (boit|chaise) = λchaise P(boit|chaise) + (1 − λchaise )P(chaise)

  Backoff methods: like interpolation but only when events are not
  observed
  Most popular approach: “modified Kneser-Ney" [James et al, 2000]
  Benoit Favre (AMU)                 PSTALN: LM/MT                January 20, 2020   8 / 35
Neural language model

   Train a (potentially recurrent) classifier to predict the next word
                        w1 w2 w3 w3     w4 w5 w6 end

                       start

   In training, two possible regimes:
     ▶   Use true word to predict next word
     ▶   Use predicted word from previous slot
                        w1 w2 w3 w3     w4 w5 w6 end

                       start w1 w2 w3 w3         w4 w5 w6

  Benoit Favre (AMU)             PSTALN: LM/MT              January 20, 2020   9 / 35
Softmax approximations
   When vocabulary is large (> 10000), the softmax layer gets too
   expensive
     ▶   Store a h × |V | matrix in GPU memory
     ▶   Training time gets very long
   Turn the problem to a sequence of decisions
     ▶   Hierarchical softmax

   Turn the problem to a small set of binary decisions
     ▶   Noise contrastive estimation, sampled softmax...
     ▶   → Pair target against a small set of randomly selected words
   More here:
   http://sebastianruder.com/word-embeddings-softmax
  Benoit Favre (AMU)             PSTALN: LM/MT               January 20, 2020   10 / 35
Perplexity
   How good is a language model?
      1   Intrinsic metric: compute the probability of a validation corpus
      2   Extrinsic metric: use it in a system and compute its performance
   Perplexity (PPL) is an intrinsic measure
      ▶   If you had a dice with one word per face, how often would you get the
          correct next word for a validation context?
      ▶   Lower is better
      ▶   Only comparable for LM trained with the same vocabulary

                        PPL(w1 . . . wn ) = p(w1 . . . wn )− n
                                                            1

                                            ∏n
                                               p(wi |wi−1 . . . w1 )− n
                                                                      1
                                          =
                                             i=1
                                                   (                       )
                                                        1∑
                                                           n
                        PPL(w1 . . . wn ) = exp2      −       log2 score(i)
                                                        n i=1

   Benoit Favre (AMU)                 PSTALN: LM/MT                   January 20, 2020   11 / 35
Limits of language modeling

   Train a language model on the One Billion Word benchmark
      ▶   “Exploring the Limits of Language Modeling", Jozefowicz et al. 2016
      ▶   800k different words
      ▶   Best model → 3 weeks on 32 GPU
      ▶   PPL: perplexity evaluation metric (lower is better)

  System                     PPL
  RNN-2048                   68.3
  Interpolated KN 5-GRAM     67.6
  LSTM-512                   32.2
  2-layer LSTM-2048          30.6
  + CNN inputs               30.0
  + CNN softmax              39.8

   Benoit Favre (AMU)            PSTALN: LM/MT                January 20, 2020   12 / 35
Byte-pair encoding (BPE)

     Word language models               Character language models
     Large decision layer               Don’t know about words
     Unknown words problem              Require stability over long history

   Word-piece models
     ▶   Split words in smaller pieces
     ▶   Frequent tokens are modeled as one piece
     ▶   Can factor morphology
   Byte pair encoding [Shibata et al, 1999]
     1   Start with alphabet containing all characters
            ⋆   Split words as characters
     2   Repeat until up to desired alphabet size (typically 10-30k)
            1   Compute most frequent 2-gram (a, b)
            2   Add to alphabet new symbol γ(a,b)
            3   Replace all occurrences of (a, b) with γ(a,b) in corpus

  Benoit Favre (AMU)                 PSTALN: LM/MT                    January 20, 2020   13 / 35
Generation from LM

Given a language model, how can we generate text?
    Start with input x = ⟨start⟩, hidden state h = 0
    Repeat until x = ⟨end⟩:
       1   Compute logits and new hidden state y, h ← model(h, x)
       2   Introduce temperature y ′ = y/θ
       3   Make distribution p = softmax(y)
       4   Draw symbol from multinomial distribution s̃ ∼ p
              1   Draw v ∼ Uniform(0, 1)  ∑
              2   Compute s̃ = argmaxs v > si=0 pi
       5   x ← s̃
    Temperature θ modifies the distribution (θ = 0.7 is a good value)
       ▶   θ < 1 is more conservative results
       ▶   θ > 1 leads to more variability

    Benoit Favre (AMU)              PSTALN: LM/MT            January 20, 2020   14 / 35
Neural LM: conclusions

   Use (recurrent) classifier to predict next word given history
      ▶   Typically train on true history
   Evaluation
      ▶   Perplexity, but not really related to downstream usefulness
   Large decision layer for realistic vocabulary
      ▶   Softmax approximations
      ▶   Maybe words are not the best representation

   Benoit Favre (AMU)              PSTALN: LM/MT                January 20, 2020   15 / 35
Translation model

How to compute P(source|target) = P(s1 , . . . , sn |t1 , . . . , tn ) ?

                                                      nb(s1 , . . . , sn → t1 , . . . , tn )
             P(s1 , . . . , sn |t1 , . . . , tn ) =     ∑
                                                          x nb(x → t1 , . . . , tn )

     Piecewise translation

     P(I am your father → Je suis ton père) =P(I → je) × P(am → suis)
                                                                      × P(your → ton)
                                                                      × P(father → père)

     To compute those probabilities
        ▶   Need for alignment between source and target words

     Benoit Favre (AMU)                     PSTALN: LM/MT                           January 20, 2020   16 / 35
Bitexts

French                                             English
     je déclare reprise la session du parlement           i declare resumed the session of the
     européen qui avait été interrompue le                european parliament adjourned on friday
     vendredi 17 décembre dernier et je vous              17 december 1999 , and i would like once
     renouvelle tous mes voeux en espérant                again to wish you a happy new year in the
     que vous avez passé de bonnes vacances .             hope that you enjoyed a pleasant festive
     comme vous avez pu le constater , le                 period .
     grand " bogue de l’ an 2000 " ne s’ est              although , as you will have seen , the
     pas produit . en revanche , les citoyens d’          dreaded ’ millennium bug ’ failed to
     un certain nombre de nos pays ont été                materialise , still the people in a number
     victimes de catastrophes naturelles qui              of countries suffered a series of natural
     ont vraiment été terribles .                         disasters that truly were dreadful .
     vous avez souhaité un débat à ce sujet               you have requested a debate on this
     dans les prochains jours , au cours de               subject in the course of the next few days
     cette période de session .                           , during this part-session .

         Benoit Favre (AMU)               PSTALN: LM/MT                       January 20, 2020   17 / 35
IBM alignment model 1
  Let s = s1 . . . sn , the source sentence et t = t1 , . . . tm , the target
  sentence
  Let P(si → ta(i) ) the probability that word ti be aligned with sa(i) .
  We try to compute an alignment a :

                                               P(a, s|t)
                             P(a|s, t) =
                                                P(s|t)

  We can write
                                       ∑
                            P(s|t) =           P(a, s|t)
                                           a

  So everything depends on P(a, s|t).
  Definition of IBM1 model :
                                 ∏
                     P(a, s|t) =      P(si → ta(i) )
                                       i

  Benoit Favre (AMU)            PSTALN: LM/MT                 January 20, 2020   18 / 35
Determine the alignment

                       Dictionnaire                     Alignements

                                                     savons
                we nous 0.3695

                                                     passe
                we avons 0.3210

                                                     nous

                                                     pas
                                                     qui
                we devons 0.2824

                                                     ne

                                                     ce

                                                     se
                ...

                                                     .
                do veuillez 0.2707            we
                do pensez-vous 0.2317          do
                do dis-je 0.2145              not
                do ne 0.0425                know
                ...
                                             what
                not pas 0.4126
                not non 0.3249
                                                is
                not ne 0.2721
                                        happening
                ...                              .

   Chicken and egg problem
     ▶   If we had an alignment we could compute translation probabilities
     ▶   If we had translation probabilities, we could compute an alignment
     ▶   → use Expectation-Maximization (EM)

  Benoit Favre (AMU)                    PSTALN: LM/MT                 January 20, 2020   19 / 35
IBM1 alignment pseudo-code
for t in target_words: # uniform probabilities
    for s in source_words:
        prob[t|s] = 1 / len(source_words)

while not converged(): # setup counters
    for s in source_words:
        total[s] = 0
        for t in target_words:
            count[t|s] = 0

    for target, source in bitext: # traverse bitexts
        for t in target:
            total_sent[t] = 0
            for s in source:
                total_sent[t] += prob[t|s]
        for t in target:
            for s in source:
                count[t|s] += prob[t|s] / total_sent[t]
                total[s] += prob[t|s] / total_sent[t]

    for s in source_words: # update probabilities
        for t in target_words:
            prob[t|s] = count[t|s] / total[s]

    Benoit Favre (AMU)                PSTALN: LM/MT       January 20, 2020   20 / 35
IBM models 2+

    Model 1 : lexical
    Model 2 : absolute reordering
    Model 3 : fertility
    Model 4 : relative reordering
    Models 5-6, HMM, learn to align ...
                         s1 s2 s3                    s1 s2 s3

                  c1 c2 c3 c4 c5                   c1 c2 c3
                   P(fertilité|s2)             P(distortion|s2,s3)

We can hope an alignment error rate < 30%
    Software: Giza++, berkeleyaligner

    Benoit Favre (AMU)               PSTALN: LM/MT              January 20, 2020   21 / 35
Which direction?

                     irrecevabilité

                                                                         irrecevabilité
                     demander

                                                                         demander
                     concerne

                                                                         concerne
                     voudrais

                                                                         voudrais
                     conseil

                                                                         conseil
                     article

                                                                         article
                     sujet

                                                                         sujet
                     vous

                                                                         vous
                     143

                                                                         143
                     qui

                                                                         qui
                     un

                                                                         un
                     de

                                                                         de
                     au

                                                                         au
                     je

                                                                         je
                     l'

                     l'

                                                                         l'

                                                                         l'
                     ,

                     .

                                                                         ,

                                                                         .
                 i                                                   i
          would                                               would
             like                                                like
           your                                                your
         advice                                              advice
          about                                               about
            rule                                                rule
            143                                                 143
     concerning                                          concerning
  inadmissibility                                     inadmissibility
                .                                                   .
                       Anglais > Français                                  Français > Anglais

                                             irrecevabilité
                                             demander

                                             concerne
                                             voudrais

                                             conseil

                                             article
                                             sujet
                                             vous

                                             143

                                             qui
                                             un

                                             de
                                             au
                                             je

                                             l'

                                             l'
                                             ,

                                             .
                                         i
                                  would
                                     like
                                   your
                                 advice
                                  about
                                    rule
                                    143
                             concerning
                          inadmissibility
                                        .
                                                     Fusion
   Benoit Favre (AMU)                         PSTALN: LM/MT                       January 20, 2020   22 / 35
Phrase table

                         savons

                                                 savons
                         passe

                                                 passe
                         nous

                                                 nous
                         pas

                                                 pas
                         qui

                                                 qui
                         ne

                                                 ne
                         ce

                         se

                                                 ce

                                                 se
                         .

                                                 .
                  we                      we
                   do                      do
                  not                     not
                know                    know
                 what                    what
                    is                      is
            happening               happening
                     .                       .
                         savons

                         passe
                                       "Phrase table"
                         nous

                         pas
                         qui
                         ne

                         ce

                         se
                                       we > nous
                  we     .             do not know > ne savons pas
                   do                  what > ce qui
                  not                  is happening > se passe
                know
                 what                  we do not know > nous ne savons pas
                    is                 what is happening > ce qui se passe
            happening
                     .

   Compute translation probability for all known phrases (an extension of
   n-gram language models)
      ▶   Combine with LM and find best translation with decoding algorithm
   Benoit Favre (AMU)           PSTALN: LM/MT                      January 20, 2020   23 / 35
Decoding problem
  Given source text and model, find best translation

                             t̂ = argmax P(t)P(s|t)
                                     c

  Decoding process
     1   For each segment of the source, generate all possible translations
     2   Combine and reorder translated pieces
     3   Apply language model
     4   Score each complete translation

  Very large search space
     ▶   Requires lots of tricks and optimization
     ▶   Pruning of least probable translations
     ▶   Notable implementation: Moses decoder [Koehn et al, 2006]
  Benoit Favre (AMU)             PSTALN: LM/MT                 January 20, 2020   24 / 35
Decoding (2)

               +Modèle de distortion
              Modèle de traduction
                                             tension rises in egypt 's capital

                                                                                      de
                                                                                      de l' égypte
                                       la tension   augmente    dans     la capitale de    l' égypte
                                                                           capitale

                                       la tension
           Modèle de langage

                                            tension augmente
                                                      augmente dans
                                                                 dans la
                                                                      la capitale
                                                                          capitale de
                                                                                    de l'
                                                                                        l' égypte

  Benoit Favre (AMU)                                     PSTALN: LM/MT                        January 20, 2020   25 / 35
Stat-MT: conclusions

   Machine translation trainable from bi-texts
     ▶   Large quantities of translation memories available
     ▶   Use alignment to infer latent link between languages
   Split problem
     ▶   Segment translation (translation model)
     ▶   Segment ordering (language model)
   Search space is large
     ▶   Decoders are complex
     ▶   Require lots of pruning and approximations
   Estimation is hard
     ▶   Pointwise maximum likelihood probability estimation
     ▶   How to deal with unseen events?

  Benoit Favre (AMU)             PSTALN: LM/MT                  January 20, 2020   26 / 35
Neural machine translation (NMT)

   Phrase-based translation
     ▶   Same coverage problem as with word-ngrams
     ▶   Alignment still wrong in 30% of cases
     ▶   A lot of tricks to make it work
     ▶   Researchers have progressively introduced NN
            ⋆   Language model
            ⋆   Phrase translation probability estimation
     ▶   The google translate approach until mid-2016
   End-to-end approach to machine translation
     ▶   Can we directly input source words and generate target words?

  Benoit Favre (AMU)                 PSTALN: LM/MT           January 20, 2020   27 / 35
Encoder-decoder framework
   Generalisation of the conditioned language model
     ▶   Build a representation, then generate sentence
     ▶   Also called the seq2seq framework

   But still limited for translation
     ▶   Bad for long sentences
     ▶   How to account for unknown words?
     ▶   How to make use of alignments?

  Benoit Favre (AMU)             PSTALN: LM/MT            January 20, 2020   28 / 35
Interlude: Pointer networks
     Decision is an offset in the input
        ▶   Number of classes dependent on the length of the input
        ▶   Decision depends on hidden state in input and hidden state in output
        ▶   Encoder state ej , decoder state di

                              yi = softmax(v ⊺ tanh(Wej + Udi ))

Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, “Pointer Networks", arXiv:1506.03134

     Benoit Favre (AMU)             PSTALN: LM/MT                   January 20, 2020   29 / 35
Attention mechanisms
    Loosely based on human visual attention mechanism
       ▶   Let neural network focus on aspects of the input to make its decision
       ▶   Learn what to attend based on what it has produced so far

     αi = softmaxj (falign (di , ej ))
          ∑
  attni =     αi,j ej
             j

     yi = softmax(W [attni ⊕ di ] + b)

  Additive attention

      +
     falign (di , ej ) =v ⊺ tanh(W1 di + W2 ej )

  Multiplicative attention

              ×
             falign (di , ej ) =di⊺ W3 ej

   Benoit Favre (AMU)                       PSTALN: LM/MT       January 20, 2020   30 / 35
Machine translation with attention

   Learns the word-to-word alignment

   Benoit Favre (AMU)       PSTALN: LM/MT   January 20, 2020   31 / 35
How to deal with unknown words

   If you don’t have attention
     ▶   Introduce unk symbols for low frequency words
     ▶   Realign them to the input a posteriori
     ▶   Use large translation dictionary or copy if proper name
   Use attention MT, extract α as alignment parameter
     ▶   Then translate input word directly
   What about morphologically rich languages?
     ▶   Reduce vocabulary size by translating word factors
            ⋆   Byte pair encoding algorithm
     ▶   Use word-level RNN to transliterate word

  Benoit Favre (AMU)                PSTALN: LM/MT              January 20, 2020   32 / 35
Zero-shot machine translation

     How to deal with the quadratic need for parallel data?
        ▶   n languages → n 2 pairs
        ▶   So far, people have been using a pivot language (x → english → y)
     Parameter sharing across language pairs
        ▶   Many to one → share the target weights
        ▶   One to many → share the source weights
        ▶   Many to many → train single system for all pairs
     Zero-shot learning
        ▶   Use token to identify target language (ex: )
        ▶   Let model learn to recognize source language
        ▶   Can process pairs never seen in training!
        ▶   The model learns the “interlingua"
        ▶   Can also handle code switching
"Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot
Translation", Johnson et al., arXiv:1611.04558

     Benoit Favre (AMU)              PSTALN: LM/MT                January 20, 2020   33 / 35
Attention is all you need
   Attention treats words as a bag
      ▶   Need RNN to convey word order
   Maybe we can encode position information as embeddings
      ▶   Absolute position
      ▶   Relative position
      ▶   Absolute and relative position?
             ⋆   → use sinusoids of different frequencies and phase
   Multiple attention heads
      ▶   Allow network to focus on multiple phenomena
   Multiple layers of attention
      ▶   Encode variables conditioned on subsets of inputs
   Transformer networks [Vaswani et al, 2017, arXiv:1706.03762]
      ▶   Encoder-decode with multiple layers of multi-head attention
      ▶   http://jalammar.github.io/illustrated-transformer/
   BERT / GPT-2
      ▶   Encoder trained on language modeling tasks

   Benoit Favre (AMU)                PSTALN: LM/MT                   January 20, 2020   34 / 35
NMT: conclusions

  Machine translation
     ▶   Transform source to target language
     ▶   Sequence to sequence (encoder-decoder) framework
  Attention mechanisms
     ▶   Learn to align inputs and outputs
     ▶   Can look at all words from input
  Self-attention
     ▶   Transformer / BERT
  Zero-shot learning
     ▶   Evade the “language-pair" requirement
     ▶   Can be interesting in all of NLP

  Benoit Favre (AMU)             PSTALN: LM/MT              January 20, 2020   35 / 35
You can also read