MT: PHRASE BASED & NEURAL ENCODER-DECODER - COMP90042 LECTURE 22 Copyright 2018 The University of Melbourne - GitHub Pages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
er geht ja nicht nach hause
yes
he
goes home
are
does not go home
it
to
COMP90042 LECTURE 22
MT: PHRASE BASED &
NEURAL ENCODER-DECODER
Copyright 2018 The University of Melbourne2
OVERVIEW
‣ Phrase based SMT
‣ Scoring formula
‣ Decoding algorithm
‣ Neural network ‘encoder-decoder’
Copyright 2018 The University of Melbourne3
WORD- AND PHRASE-BASED MT
‣ Seen word based models of translation
‣ now used for alignment, but not actual translation
‣ overly simplistic formulation
‣ Phrase based MT
‣ treats n-grams as translation units, referred to as
146 Chapter 5. Phrase-Based Models
‘phrases’ (not linguistic phrases though)
Copyright 2018 The University of Melbourne Fig from Koehn09
Figure 5.1: Phrase-based machine translation. The input is segmented into4
PHRASE VS WORD BASED MT
‣ Phrase-pairs memorise:
‣ common translation fragments (have access to local
context in choosing lexical translation)
‣ common reordering patterns (making up for naïve models
of reordering)
did not slap the green witch
did not slap
no dio una bofetada
the green witch la bruja verde
no dio una bofetada
Copyright 2018 The University of Melbourne la bruja verde5
FINDING & SCORING PHRASE PAIRS
michael
Chapter 5. Phrase-Based Models
davon
‣ “Extract” phrase pairs as
bleibt
haus
dass
geht
aus
im
er
michael
, contiguous chunks in word
assumes aligned text; then
that
he ‣ compute counts over the
will
whole corpus
stay
in
‣ normalise counts to produce
the
house
‘probabilities’
Extracting a phrase from a word alignment. The English phrase ‣ E.g.,
and
Figthefrom
German phrase geht davon aus , dass are aligned, because
Koehn09
re aligned to each other. (im haus bleibt|will stay in the house)
c(will stay in the house; im haus bleibt)
=
c(im haus bleibt)
Copyright 2018 The University of MelbourneFigure 5.1: Phrase-based machine translation. The input is segmented into
6
THE PHRASE-TABLE
phrases (not necessarily linguistically motivated), translated one-to-one into phrases
in English and possibly reordered.
‣ The
by five phrasephrase-table
pairs. The Englishconsists of to
phrases have allbephrase-pairs
reordered, so thatand
the
theirthe
verb follows scores,
subject. which forms the search space for
Thedecoding
German word natuerlich best translates into of course. To cap-
ture this, we would like to have a translation table that maps not words
‣ E.g.,A for
but phrases. natuerlich
phrase it may
translation contain
table the following
of English translation
translations for the
phrases may look like the following:
German natuerlich
Translation Probability p(e|f )
of course 0.5
naturally 0.3
of course , 0.15
, of course , 0.05
It is important to point out that current phrase-based models are not
‣ generally a massive list with many millions of phrase-pairs
rooted in any deep linguistic notion of the concept phrase. One of the
phrases in 2018
Copyright Figure 5.1 is offun
The University with the. This is an unusual grouping. Most
Melbourne7
DECODING
⇤ ⇤
E , A = argmaxE,A score(E, A, F )
‣ A describes the segmentation of F into phrases;
and the re-ordering of their translations to produce E
‣ The score function is a product of the
‣ translation “probability”, P(F|E), split into phrase-pairs
‣ language model probability, P(E), over full sentence E
‣ distortion cost, d(starti, endi-1), measuring amount of
reordering between adjacent phrase-pairs
‣ Search problem
‣ find translation E* with the best overall score
Copyright 2018 The University of Melbourne8
TRANSLATION PROCESS
‣ Score the translations based on translation
probabilities (step 2), reordering (step 3) and
language model scores (steps 2 & 3).
er geht ja nicht nach hause
1: segment er geht ja nicht nach hause
2: translate he go does not home
3: order he does not go home
Copyright 2018 The University of Melbourne Figure from Koehn, 20099
SEARCH PROBLEM
er geht ja nicht nach hause
he is yes not after house
it are is do not to home
, it goes , of course does not according to chamber
, he go , is not in at home
it is not home
he will be is not under house
it goes does not return home
he goes do not do not
is to
are following
is after all not after
does not to
not
is not
are not
is not a
‣ Cover all source words exactly once; visited in any order; and
with any segmentation into “phrases”
‣ Choose a translation from phrase-table options
Leads to millions of possible translations…
Figure from Koehn, 2009
Copyright 2018 The University of Melbourne10
DYNAMIC PROGRAMMING SOLUTION
‣ Akin to Viterbi algorithm
‣ factor out repeated computation
(like Viterbi for HMMs, “chart” used in parsing)
‣ efficiently solve the maximisation problem
‣ Aim is to translate every word of the input once
‣ searching over every segmentation into phrases;
‣ the translations of each phrase; and
‣ all possible ordering of the phrases
Copyright 2018 The University of Melbourne11
PHRASE-BASED DECODING
er geht ja nicht nach hause
Start with empty state
Copyright 2018 The University of Melbourne Figure from Koehn, 200912
PHRASE-BASED DECODING
er geht ja nicht nach hause
are
Expand by choosing
input span and
generating translation
Copyright 2018 The University of Melbourne
Figure from Koehn, 200913
PHRASE-BASED DECODING
er geht ja nicht nach hause
he
are
Consider all possible
it options to start the
translation
Copyright 2018 The University of Melbourne
Figure from Koehn, 200914
PHRASE-BASED DECODING
er geht ja nicht nach hause
Continue to expand states, visiting
uncovered words. Generating
outputs left to right.
yes
he
goes home
are
does not go home
it
to
Copyright 2018 The University of Melbourne
Figure from Koehn, 200915
PHRASE-BASED DECODING
er geht ja nicht nach hause
Read off translation from best
complete derivation by back-
tracking
yes
he
goes home
are
does not go home
it
to
Copyright 2018 The University of Melbourne
Figure from Koehn, 200916
REPRESENTING TRANSLATION STATE
‣ Need to record
‣ translation of phrase
‣ which words are translated in bit-vector
‣ last n-1 words in E… so that ngram LM can compute
probability of subsequent words
‣ end position of the last phrase translated in the source,
for scoring distortion in next step
‣ Together allows for the score computation to be
factorised
Copyright 2018 The University of Melbourne17
COMPLEXITY
‣ Full search is intractable
‣ word-based and phrase-based decoding is NP complete
— arises from arbitrary reordering
‣ A solution is to prune the search space
‣ Use beam search, a form of approximate search
‣ maintaining no more than k options (“hypotheses")
‣ pruning over translations that cover a given number of
input words
Copyright 2018 The University of Melbourne20
PHRASE-BASED MT SUMMARY
‣ Start with sentence-aligned parallel text
1. learn word alignments
2. extract phrase-pairs from word alignments &
normalise counts
3. learn a language model
‣ Now decode test sentences using
beam-search (where 2 & 3 above form part of
scoring function)
Copyright 2018 The University of Melbourne21
NEURAL MACHINE TRANSLATION
‣ Phrase-based approach is rather complicated!
‣ Neural approach poses question:
‣ Can we throw away all this complexity, instead learn a
single model to directly translate from source to target?
‣ Using deep learning of neural networks
‣ learn robust representations of words and sentences
‣ attempts to generate words in the target given “deep”
(vector/matrix) representation of the source
Copyright 2018 The University of Melbourne22
ENCODER-DECODER MODELS
‣ So-called “sequence2sequence” models combine:
‣ encoder which represents the source sentence as a
vector or matrix of real values
‣ akin to word2vec’s method for learning word vectors
‣ decoder which predicts the word sequence in the target
‣ framed as a language model, albeit conditioned on the encoder
representation
Copyright 2018 The University of MelbourneRECURRENT NEURAL NETWORKS
(RNNS)
c
start
START x1 x2 x3 x4
What is a vector representation of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ENCODER-DECODERS
c
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ENCODER-DECODERS
Beginnings are difficult STOP
START
c
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ATTENTION MODEL
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ATTENTION MODEL
Beginnings
START
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ATTENTION MODEL
Beginnings are
START
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ATTENTION MODEL
Beginnings are difficult
START
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015RNN ATTENTION MODEL
Beginnings are difficult STOP
START
Aller Anfang ist schwer STOP
What is the probability of a sequence ?
Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 201531
APPLICATIONS OF SEQ2SEQ
‣ Machine translation
‣ Summarisation (document as input)
‣ Speech recognition & speech synthesis
‣ Image captioning & image generation
‣ Word morphology (over characters)
‣ e.g., study → student; receive → recipient;
play → player; pay → payer/payee
‣ Generating source code from text & more….
Copyright 2018 The University of Melbourne32
EVALUATION: DID IT WORK?
‣ Given input in Persian
, هنر امپرسیونیسم, رقص باله, تلویزیون,ملبورن مهد و مرکز پیدایش صنعت فیملسازی و سیمنا
سبکهای مختلف رقص مثل نیو وگ و ملبورن شافل در استرالیا و مرکز مهم موزﯾﮏ کالسﯾﮏ و امروزی در
.این کشوراست
‣ Google translate outputs the English
Melbourne cradle and center of origin of the film industry and cinema, television,
ballet, art, impressionism, various dance styles such as New Vogue and the
Melbourne Shuffle in Australia and an important center of classical and
contemporary music in this country.
‣ Ask bilingual to judge? Ask to rate for two components
‣ fluency: follows grammar of English, and semantically coherent
‣ adequacy: contains the same information as the original source document
‣ or edit the sentence until is is adequate, and measure #changes, time spent etc
Copyright 2018 The University of Melbourne33
RESUABLE EVALUATION
‣ What if we have one (or several) good translations,
e.g.
Referred to as Australia's “cultural capital” it
is the birthplace of Australian
impressionism, Australian rules football, the
Australian film and television industries, and
Australian contemporary dance such as the
Melbourne Shuffle. It is recognised as a
UNESCO City of Literature and a major
centre for street art, music and theatre.
‣ We can use this text to evaluate many different MT
system outputs for the same input
Copyright 2018 The University of Melbourne34
AUTOMATIC EVALUATION
‣ How many words are the shared between output:
Melbourne cradle and center of origin of the film industry and cinema,
television, ballet, art, impressionism, various dance styles such as New
Vogue and the Melbourne Shuffle in Australia and an important center of
classical and contemporary music in this country.
‣ And the reference:
Referred to as Australia’s “cultural capital” it is the birthplace of Australian
impressionism, Australian rules football, the Australian film and television
industries, and Australian contemporary dance such as the Melbourne
Shuffle. It is recognised as a UNESCO City of Literature and a major centre for
street art, music and theatre.
Copyright 2018 The University of Melbourne35
MT EVALUATION: BLEU
‣ BLEU measures closeness of translation to one or
more references
‣ defined as:
BLEU = bp ⨉ prec1-gram ⨉ prec2-gram ⨉ prec3-gram ⨉ prec4-gram
‣ weighted average of 1, 2, 3 & 4-gram precisions
‣ precn-gram = num n-grams correct / num n-grams predicted in output
‣ numerator clipped to #occurences of ngram in the reference
‣ and a brevity penality to hedge against short outputs
‣ bp = min ( 1, output length / reference length )
‣ Correlates with human judgements of fluency &
adequacy
Copyright 2018 The University of Melbourne36
SUMMARY
‣ Word vs phrase based MT
‣ Components of phrase-base approach
‣ Decoding algorithm
‣ Neural encoder-decoder
‣ Evaluation using BLEU
‣ Reading
‣ JM2 25.7 – 25.9
‣ Neural Machine Translation and Sequence-to-sequence
Models: A Tutorial, Neubig 201, Sections 7 & 8
https://arxiv.org/abs/1703.01619
Copyright 2018 The University of MelbourneYou can also read