THE HIFST SYSTEM FOR THE EUROPARL SPANISH-TO-ENGLISH TASK

Page created by Rose Casey

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

∗
The HiFST System for the EuroParl Spanish-to-English Task

Gonzalo Iglesias⋆ Adrià de Gispert‡ Eduardo R. Banga⋆ William Byrne‡
⋆
University of Vigo. Dept. of Signal Processing and Communications. Vigo, Spain
{giglesia,erbanga}@gts.tsc.uvigo.es
‡
University of Cambridge. Dept. of Engineering. CB2 1PZ Cambridge, U.K.
{ad465,wjb31}@eng.cam.ac.uk

Abstract: In this paper we present results for the Europarl Spanish-to-English
translation task. We use HiFST, a novel hierarchical phrase-based translation sys-
tem implemented with finite-state technology that creates target lattices rather than
k-best lists.
Keywords: Statistical Machine Translation, Hierarchical Phrase-Based Transla-
tion, Rule Patterns

1 Introduction ried out at the upper-most cell, after the full
In this paper we report translation results CYK grid has been traversed.
in the EuroParl Spanish-to-English trans-
1.1 Related Work
lation task using HiFST, a novel hierar-
chical phrase-based translation system that Hierarchical translation systems (Chiang,
builds word translation lattices guided by a 2007), or simply Hiero, share many com-
CYK parser based on a synchronous context- mon features with standard phrase-based sys-
free grammar induced from automatic word tems (Koehn, Och, and Marcu, 2003), such
alignments. HiFST has achieved state- as feature-based translation and strong tar-
of-the-art results for Arabic-to-English and get language models. However, they differ
Chinese-to-English translation tasks (Iglesias greatly in that they guide their reordering
et al., 2009a). This decoder is easily im- model by a powerful synchronous context-
plemented with Weighted Finite-State Trans- free grammar, which leads to flexible and
ducers (WFSTs) by means of standard op- highly lexicalized reorderings. They yield
erations such as union, concatenation, ep- state-of-the-art performance for some of the
silon removal, determinization, minimization most complex translation tasks, such as
and shortestpath, available for instance in Chinese-to-English. Thus, Hiero decoding is
the OpenFst libraries (Allauzen et al., 2007). one of the dominant trends in the field of Sta-
In every CYK cell we build a single, mini- tistical Machine Translation.
mal word lattice containing all possible trans- We summarize some extensions to the ba-
lations of the source sentence span covered sic approach to put our work in context.
by that cell. When derivations contain non- Hiero Search Refinements: Huang and
terminals, arcs refer to lower-level lattices Chiang (2007) offer several refinements to
for memory efficiency, working effectively as cube pruning to improve translation speed.
pointers. These are only expanded to the ac- Venugopal et al. (2007) introduce a Hiero
tual translations if pruning is required dur- variant with relaxed constraints for hypothe-
ing search; expansion is otherwise only car- sis recombination during parsing; speed and
results are comparable to those of cube
∗
This work was supported in part by the GALE pruning, as described by Chiang (2007).
program of the Defense Advanced Research Projects
Agency, Contract No. HR0011- 06-C-0022. G. Igle-
Li and Khudanpur (2008) report signifi-
sias supported by Spanish Government research grant cant improvements in translation speed by
BES-2007-15956 (project TEC2006-13694-C03-03). taking unseen n-grams into account within

cube pruning to minimize language model Younger-Kasami (CYK) algorithm closely re-
requests. Dyer et al. (2008) extend the lated to CYK+ (Chappelier and Rajman,
translation of source sentences to transla- 1998), for which hypothesis recombination
tion of input lattices following Chappelier et without pruning is performed and back-
al. (1999). pointers are mantained. Although the model
Extensions to Hiero: Several authors de- is a synchronous grammar, in this stage only
scribe extensions to Hiero, to incorporate ad- the source language sentence is parsed using
ditional syntactic information (Zollmann and the corresponding context-free grammar with
Venugopal, 2006; Zhang and Gildea, 2006; rules N → γ. Each cell in the CYK grid is
Shen, Xu, and Weischedel, 2008; Marton and specified by a non-terminal symbol and posi-
Resnik, 2008), or to combine it with discrim- tion in the CYK grid: (N, x, y), which spans
inative latent models (Blunsom, Cohn, and sx+y−1
x on the source sentence s1 ...sJ .
Osborne, 2008). For the second step, we use a recursive
Analysis and Contrastive Experiments: algorithm to construct word lattices with
Zollman et al. (2008) compare phrase-based, all possible translations produced by the hi-
hierarchical and syntax-augmented decoders erarchical rules. Construction proceeds by
for translation of Arabic, Chinese, and Urdu traversing the CYK grid along the back-
into English. Lopez (2008) explores whether pointers established in parsing. In each cell
lexical reordering or the phrase discontigu- (N, x, y) of the CYK grid, we build a tar-
ity inherent in hierarchical rules explains im- get language word lattice L(N, x, y). This
provements over phrase-based systems. Hi- lattice contains every translation of sx+y−1
x
erarchical translation has also been used to from every derivation headed by N . For each
great effect in combination with other trans- rule in this cell, a simple acceptor is built
lation architectures, e.g. (Sim et al., 2007; based on the target side, as the result of stan-
Rosti et al., 2007). dard concatenation of acceptors that encode
WFSTs for Translation: There is exten- either terminals or non-terminals. A word
sive work in using Weighted Finite State is encoded trivially with an acceptor of two
Transducers for machine translation (Banga- states binded by a single transition arc. In
lore and Riccardi, 2001; Casacuberta, 2001; turn, a non-terminal corresponds to a lattice
Kumar and Byrne, 2005; Mathias and Byrne, retrieved by means of its back-pointer to a
2006; Graehl, Knight, and May, 2008). low-level cell. If this low-level lattice has not
This paper proceeds as follows. Section 2 been required previously, it has to be built
describes the HiFST system; section 3 pro- first. Once we have all the acceptors (one per
vides details of the translation task and opti- rule) that apply to (N, x, y), we obtain the
mization. Section 4 discusses different strate- cell lattice L(N, x, y) by unioning all these ac-
gies in order to reduce the size of grammar. ceptors. For a source sentence with J words,
Finally, section 5 shows rescoring results, af- the lattice we are looking for is at the top cell
ter which we conclude. (S, 1, J).
The final translation lattice L(S, 1, J) can
2 HiFST System Description grow very large after the pointer arcs are ex-
In brief, HiFST is a hierarchical decoder that panded. Therefore, in the third step we ap-
builds target word lattices guided by a syn- ply a word-based language model via WFST
chronous context-free grammar consisting of composition, and perform likelihood-based
a set R = {Rr } of rules Rr : N → hγ r ,αr i / pruning (Allauzen et al., 2007) based on
pr . A priori, N represents any non-terminal; the combined translation and language model
in this paper, N can be either S, X, or V . scores.
As usual, the special glue rules S → hX,Xi This method can be seen as a generaliza-
and S → hS X,S Xi are included. T de- tion of the k-best algorithm with cube prun-
notes the terminals (words), and the gram- ing (Chiang, 2007), as the performance of
mar builds parses based on strings γ, α ∈ this cube pruning search is clearly limited
{{S, X, V } ∪ T}+ , where we follow Chiang’s by the size k of each k-best list. For small
general restrictions to the grammar (2007). tasks where k is sufficiently large compared to
As shown in Figure 1, the system per- the number of translations of each derivation,
forms translation in three main steps. The search could be exhaustive. On the other
first step is a variant of the classic Cocke- hand, for reasonably large tasks, the inven-

Figure 1: The HiFST System

tory of hierarchical phrases is much bigger sentences words vocab
than standard phrase pair tables, and using ES 38.2M 140k
1.30M
a large k is impossible without exponentially EN 35.7M 106k
increasing memory requirements and decod-
ing time. In practice, values of (no more Table 1: Parallel corpora statistics.
than) k=1000 or k=10000 are used. This
results in search errors. Search errors have
two negative consequences. Clearly, transla- etc.
tion quality is undermined as the obtained
first-best hypothesis is suboptimal given the 2.2 Pruning in Lattice Building
models. Additionally, the quality of the ob- Ideally, pruning-in-search performed during
tained k-best list is also suboptimal, which lattice building should be avoided, but this is
limits the margin of gain potentially achieved not always possible, depending on the com-
by subsequent re-ranking strategies (such as plexity of the grammar and the size of the
high order language model rescoring or Min- sentence. In HiFST this pruning-in-search
imum Bayes Risk). is triggered selectively by three concurrent
factors: number of states of a given lat-
2.1 Skeleton Lattices tice in a cell, size of source-side span (in
As explained previously, the lattice build- words) and the non-terminal category. This
ing step uses a recursive algorithm. If this pruning-in-search consists of applying the
is actually carried out over full word lat- language model via composition, likelihood-
tices, memory consumption would increase based pruning and posterior removal of lan-
dramatically. Fortunately, we can avoid this guage model scores.
by using arcs that encode a reference to low
level lattices, thus working as pointers. In 3 Development and Tuning
other words, we actually build for each cell We present results on the Spanish-to-English
a skeleton of the desired lattice. This skele- translation shared task of the ACL 2008
ton lattice is a mixture of words and pointers Workshop on Statistical Machine Transla-
to low-level cell lattices, but WFST operation, WMT (Callison-Burch et al., 2008).
tions are still possible. In this way, skeleton The parallel corpus statistics are summarized
lattices in each cell can be greatly reduced in Table 1. Specifically, throughout all our
through operations such as epsilon removal, experiments we use the Europarl dev2006
determinization and minimization. Expan- and test2008 sets for development and test,
sion is ideally carried out with the final skele- respectively.
ton lattice in (S, 1, J) using a single recursive
The training was performed using lower-
replacement operation. Nevertheless, redun-
cased data. Word alignments were gen-
dancy still remains in the final expanded lat-
erated using GIZA++ (Och, 2003) over a
tice as concatenation and union of sublattices
stemmed1 version of the parallel text. After
with different spans typically lead to repeated
unioning the Viterbi alignments, the stems
hypotheses.
were replaced with their original words, and
The use of the lattice pointer arc was in- phrase-based rules of up to five source words
spired by the ‘lazy evaluation’ techniques de- in length were extracted (Koehn, Och, and
veloped by Mohri et al (2000). Its implemen-
tation uses the infrastructure provided by the 1
We used snowball stemmer, available at
OpenFST libraries for delayed composition, http://snowball.tartarus.org.

Marcu, 2003). Hierarchical rules with up                     Excluded Rules              Types
to two non-contiguous non-terminals in the             hX1 w,X1 wi , hwX1 ,wX1 i       1530797
source side are then extracted applying the                     hX1 wX2 ,∗i             737024
restrictions described by Chiang (2007).                 hX1 wX2 w,X1 wX2 wi ,
                                                                                      41600246
   The Europarl language model is a Kneser-               hwX1 wX2 ,wX1 wX2 i
Ney (Kneser and Ney, 1995) smoothed de-                      hwX1 wX2 w,∗i            45162093
fault cutoff 4-gram back-off language model            Nnt .Ne = 1.3 mincount=5       39013887
estimated over the concatenation of the Eu-            Nnt .Ne = 2.4 mincount=10       6836855
roparl and News language model training
data.                                                Table 2: Rules excluded from grammar G.
   Minimum error training (Och, 2003) un-
der BLEU (Papineni et al., 2001) is used to
optimize the feature weights of the decoder         4.1   Filtering by Patterns and
with respect to the dev2006 development set.              Mincounts
We obtain a k-best list from the translation        Even after applying the rule extraction con-
lattice and extract each feature score with         straints proposed by Chiang (2007), our ini-
an aligner variant of a k-best cube-pruning         tial grammar G for dev2006 exceeds 138M
decoder. This variant produces very effi-           rules, of which only 1M are simple phrase-
ciently the most probable rule segmentation         based rules. With the following procedure
that generated the output hypothesis, along         we reduce the size of the initial grammar with
with each feature contribution. The following       a more informed criterion than general min-
features are optimized:                             count filtering.
    • Target language model                             A rule pattern is simply obtained by re-
                                                    placing every sequence of terminals by a sin-
    • Number of usages of the glue rule             gle symbol ‘w’ (indicating word, i.e. terminal
                                                    string, w ∈ T+ ). Every rule of the grammar
    • Word and rule insertion penalties             has one unique corresponding rule pattern.
    • Word deletion scale factor                    For instance, rule:

    • Source-to-target and target-to-source         h X2 en X1 ocasiones , on X1 occasions X2 i
      translation models
                                                    would correspond to the rule pattern:
    • Source-to-target and target-to-source
      lexical models                                          h X2 w X1 w , w X1 w X2 i

    • Three rule count features inspired                We further classify patterns by num-
      by (Bender et al., 2007) that classify rule   ber of non-terminals Nnt and elements
      ocurrences (one, two, or more than two        Ne (non-terminals and substring of termi-
      times respectively).                          nals). There are 5 possible classes: Nnt .Ne =
                                                    1.2, 1.3, 2.3, 2.4, 2.5. We apply different min-
4     Grammar Design                                count filterings to each class.
                                                        Our first working grammar was built
In order to work with reasonably small gram-        by excluding patterns reported in Table 2
mar – yet competitive in performance, we ap-        and limiting the number of translations per
ply three filtering strategies succesfully used     source-side to 20. In brief, we have fil-
for Chinese-to-English and Arabic-to-English        tered out identical patterns (corresponding
translation tasks (Iglesias et al., 2009b).         to rules with the same source and target pat-
                                                    tern) and some monotonic non-terminal pat-
    • Pattern and mincount-per-class filtering      terns (rule patterns in which non-terminals
    • Hiero Shallow model                           do not reorder from source to target). Iden-
                                                    tical patterns encompass a large number of
    • Filtering by number of translations.          rules and we have not been able to improve
                                                    performance by using them in other trans-
   We also add deletion rules, i.e. rules that      lation tasks. Additionally, we have also ap-
delete single words. In the following subsec-       plied mincount filtering to Nnt .Ne =1.3 and
tions we explain these strategies.                  Nnt .Ne =2.4.

Hiero Model dev2006 test2008
Shallow 33.65/7.852 33.65/7.877
Full 33.63/7.849 33.66/7.880

Table 3: Performance of Hiero Full versus Hi-
ero Shallow Grammars.

NT dev2006 test2008
20 33.65/7.852 33.65/7.877
Figure 2: Consecutive Null filtering. 30 33.61/7.849 33.75/7.896
40 33.63/7.853 33.73/7.883
4.2 Deletion Rules Table 4: Performance of G1 when varying the
It has been experimentally found that sta- filter by number of translations, NT.
tistical machine translation systems tend to
benefit from allowing a small number of dele-
tions. In other words, allowing some in- a filtering technique. Whether it has a neg-
put words to be ignored (untranslated, or ative impact on performance or not depends
translated to NULL) can improve transla- on each translation task: for instance, it was
tion output. For this purpose, we add to our not useful for Chinese-to-English, as this task
grammar one deletion rule for each source- takes advantage of nested rules to find bet-
language word, i.e. synchronous rules with ter reorderings encompassing a large number
the target side set to a special tag identi- of words. On the other hand, a Spanish-to-
fying the null word. In practice, this rep- English translation task is not expected to re-
resents a huge increase in the search space quire big reorderings: thus, as a premise it is
as any number of consecutive words can be a good candidate for this kind of grammars.
left untranslated. To control this undesired In effect, Table 3 shows that a hierarchical
situtation, we would like to limit the number shallow grammar yields the same perfomance
of consecutive deleted words to one. This is as full hierarchical translation.
achieved by means of standard composition
with a simple finite state transducer shown 4.4 Filtering by Number of
in Figure 2, where all words are allowed save Translations
for null word after having accepted one. Filtering rules by a fixed number of transla-
4.3 Hiero Shallow Model tions per source-side (N T ) allows faster de-
coding with the same perfomance. As stated
A traditional Hiero grammar (X → hγ,αi,
before, the previous experiments for this task
γ, α ∈ ({X} ∪ T)+ , in addition to the glue
used a convenient baseline filtering of 20
rules) allows rule nesting only limited to the
translations. In our experience, this has been
maximum word span. In contrast, the Hi-
a good threshold for the NIST 2008 Arabic-
ero Shallow model allows hierarchical rules
to-English and Chinese-to-English transla-
to be applied only once on top of phrase-
tion tasks (Iglesias et al., 2009b; Iglesias et
based rules for the same given span. Stated
al., 2009a). In Table 4 we compare perfor-
more formally, for s, t ∈ T+ and γs , αs ∈
mance of our shallow grammar with different
({V } ∪ T)+ the Hiero Shallow grammar con-
filterings, i.e. by 30 and 40 translations re-
sists of three kind of rules:
spectively. Interestingly, the grammar with
X → hγs ,αs i 30 translations yields a slight improvement,
X → hV ,V i but widening to 40 translations does not im-
V → hs,ti prove the translation system in performance.

Shifting from one grammar to another is 4.5 Revisiting Patterns and Class
as simple as rewriting X non-terminals in γ, α Mincounts
to V . It is easy to imagine the impact this In order to review the grammar design de-
has in terms of speed as it reduces drastically cisions taken in Section 4, and assess their
the size of the search space. In this sense, us- impact in translation quality, we consider
ing Hiero Shallow Grammar can be seen as three competing grammars, i.e. G1, G2

dev2006 test2008 dev2006 test2008
G1 33.65/7.852 33.65/7.877 HiFST 33.61/7.849 33.75/7.896
G2 33.47/7.838 33.65/7.877 +5gram 33.66 /7.902 33.90/7.954
G3 33.09/7.787 33.14/7.808 +MBR 33.87 /7.901 34.24/7.962

Table 5: Contrastive performance with three Table 6: EuroParl Spanish-to-English trans-
slightly different grammars. lation results (lower-cased IBM BLEU /
NIST) after MET and subsequent rescoring
steps
and G3. G1 is the shallow grammar with
N T = 20 already used (baseline). G2 is a
subset of G1 (3.65M rules) with mincount examples extracted from dev2006. Scores
filtering of 5 applied to Nnt .Ne = 2.3 and are state-of-the-art, comparable to the top
Nnt .Ne = 2.5. With this smaller grammar submissions in the W M T 08 shared-task re-
(3.25M rules) we would like to evaluate if sults (Callison-Burch et al., 2008).
we can obtain the same performance. G3
(4.42M rules) is a superset of G1 where the 6 Conclusions
identical pattern hX1 w,X1 wi has been added.
Table 5 shows translation performance with HiFST is a hierarchical phrase-based transla-
each of them. Decrease in performance for tion system that builds target word lattices.
G2 is not surprising. These rules filtered out Based on well known standard WFST opera-
from G2 belong to reordered non-terminal tions (such as union, concatenation and com-
rule patterns (Nnt .Ne = 2.3 and Nnt .Ne = 2.5) position, among others), it is easy to imple-
and some highly lexicalized monotonic non- ment, e.g. with the OpenFST library. The
terminal patterns from Nnt .Ne = 2.5, with compact representation of multiple transla-
three subsequence of words. More interesting tion hypotheses in lattice form requires less
is the comparison between G1 and G3, where pruning and performs better hypotheses re-
we see that this extra identical rule pattern combination (i.e. by standard determiniza-
produces a degradation in performance. tion) than cube pruning decoders, yield-
ing fewer search errors and reduced overall
5 Rescoring and Final Results memory use over k-best lists. For the Eu-
roParl Spanish-to-English translation task,
After translation with optimized feature
we have shown that Hiero grammars per-
weights, we carry out the two following
form similarly to Hiero Shallow grammars,
rescoring steps to the output lattice.
which only allow one level of hierarchical
rules and thus yield much faster decoding
• Large-LM rescoring. We build
times. We also present results with language-
sentence-specific zero-cutoff stupid-
model rescoring and MBR. The performance
backoff (Brants et al., 2007) 5-gram
of our system is state-of-the-art for Spanish-
language models, estimated using ∼4.7B
to-English (Callison-Burch et al., 2008).
words of English newswire text, and
apply them to rescore word lattice
generated by HiFST. References
• Minimum Bayes Risk (MBR). We Allauzen, Cyril, Michael Riley, Johan Schalk-
rescore the first 1000-best hypotheses wyk, Wojciech Skut, and Mehryar Mohri.
with MBR, taking the negative sen- 2007. OpenFst: A general and efficient
tence level BLEU score as the loss func- weighted finite-state transducer library.
tion (Kumar and Byrne, 2004). In Proceedings of CIAA, pages 11–23.
Bangalore, Srinivas and Giuseppe Riccardi.
Table 6 shows results for our best Hi- 2001. A finite-state approach to machine
ero model so far (using G1 with N T = translation. In Proceedings of NAACL.
30) and subsequent rescoring steps. Gains
from large language models are more mod- Bender, Oliver, Evgeny Matusov, Stefan
est than MBR, possibly due to the domain Hahn, Sasa Hasan, Shahram Khadivi,
discrepancy between the EuroParl and the and Hermann Ney. 2007. The RWTH
additional newswire data. Table 7 contains Arabic-to-English spoken language trans-

Spanish                                    English
        Estoy de acuerdo con él en cuanto al      I agree with him about the central role
        papel central que debe conservar en el     that must be retained in the future the
        futuro la comisión como garante del in-   commission as guardian of the general
        terés general comunitario.                interest of the community.
        Por ello, creo que es muy importante       I therefore believe that it is very im-
        que el presidente del eurogrupo -que       portant that the president of the eu-
        nosotros hemos querido crear- con-         rogroup - which we have wanted to cre-
        serve toda su función en esta materia.    ate - retains all its role in this area.
        Creo por este motivo que el método del    I therefore believe that the method of
        convenio es bueno y que en el futuro       the convention is good and that in the
        deberá utilizarse mucho más.             future must be used much more.

            Table 7: Examples from the EuroParl Spanish-to-English dev2006 set.

   lation system. In Proceedings of ASRU,              lattice translation. In Proceedings of
   pages 396–401.                                      ACL-HLT, pages 1012–1020.
Blunsom, Phil, Trevor Cohn, and Miles Os-           Graehl, Jonathan, Kevin Knight, and
   borne. 2008. A discriminative latent vari-         Jonathan May.   2008.    Training tree
   able model for statistical machine transla-        transducers. Computational Linguistics,
   tion. In Proceedings of ACL-HLT, pages             34(3):391–427.
   200–208.                                         Huang, Liang and David Chiang. 2007. For-
Brants, Thorsten, Ashok C. Popat, Peng Xu,            est rescoring: Faster decoding with inte-
  Franz J. Och, and Jeffrey Dean. 2007.               grated language models. In Proceedings
  Large language models in machine trans-             of ACL, pages 144–151.
  lation. In Proceedings of EMNLP-ACL,              Iglesias, Gonzalo, Adrià de Gispert, Eduardo
  pages 858–867.                                       R. Banga, and William Byrne. 2009a. Hi-
Callison-Burch, Chris, Cameron Fordyce,                erarchical phrase-based translation with
  Philipp Koehn, Christof Monz, and Josh               weighted finite state transducers. In Pro-
  Schroeder. 2008. Further meta-evaluation             ceedings of NAACL, pages 433–441.
  of machine translation. In Proceedings of         Iglesias, Gonzalo, Adrià de Gispert, Eduardo
  the ACL Workshop on Statistical Machine              R. Banga, and William Byrne. 2009b.
  Translation, pages 70–106.                           Rule filtering by pattern for efficient hi-
                                                       erarchical translation. In Proceedings of
Casacuberta, Francisco. 2001. Finite-state
                                                       EACL, pages 380–388.
  transducers for speech-input translation.
  In Proceedings of ASRU.                           Kneser, R. and H. Ney. 1995. Improved
                                                      backing-off for m-gram language model-
Chappelier, Jean-Cédric and Martin Raj-              ing. In Proceedings of ICASSP, volume 1,
  man. 1998. A generalized CYK algorithm              pages 181–184.
  for parsing stochastic CFG. In Proceed-
  ings of TAPD, pages 133–137.                      Koehn, Philipp, Franz J. Och, and Daniel
                                                      Marcu. 2003. Statistical phrase-based
Chappelier, Jean-Cédric, Martin Rajman,              translation. In Proceedings of NAACL,
  Ramón Aragüés, and Antoine Rozenknop.            pages 48–54.
  1999. Lattice parsing for speech recogni-
  tion. In Proceedings of TALN, pages 95–           Kumar, Shankar and William Byrne. 2004.
  104.                                                Minimum Bayes-risk decoding for statisti-
                                                      cal machine translation. In Proceedings of
Chiang, David. 2007. Hierarchical phrase-             HLT-NAACL, pages 169–176.
  based translation. Computational Lin-
                                                    Kumar, Shankar and William Byrne. 2005.
  guistics, 33(2):201–228.
                                                      Local phrase reordering models for statis-
Dyer, Christopher, Smaranda Muresan, and              tical machine translation. In Proceedings
  Philip Resnik. 2008. Generalizing word              of HLT-EMNLP, pages 161–168.

Li, Zhifei and Sanjeev Khudanpur. 2008.          Zollmann, Andreas and Ashish Venugopal.
   A scalable decoder for parsing-based ma-         2006. Syntax augmented machine trans-
   chine translation with equivalent language       lation via chart parsing. In Proceedings of
   model state maintenance. In Proceedings          NAACL Workshop on Statistical Machine
   of the ACL-HLT Second Workshop on                Translation, pages 138–141.
   Syntax and Structure in Statistical Trans-
                                                 Zollmann, Andreas, Ashish Venugopal, Franz
   lation, pages 10–18.
                                                    Och, and Jay Ponte. 2008. A system-
Lopez, Adam. 2008. Tera-scale translation           atic comparison of phrase-based, hierar-
  models via pattern matching. In Proceed-          chical and syntax-augmented statistical
  ings of COLING, pages 505–512.                    MT. In Proceedings of COLING, pages
Marton, Yuval and Philip Resnik. 2008.              1145–1152.
  Soft syntactic constraints for hierarchical
  phrased-based translation. In Proceedings
  of ACL-HLT, pages 1003–1011.
Mathias, Lambert and William Byrne. 2006.
  Statistical phrase-based speech transla-
  tion. In Proceedings of ICASSP.
Mohri, Mehryar, Fernando Pereira, and
  Michael Riley. 2000. The design princi-
  ples of a weighted finite-state transducer
  library. Theoretical Computer Science,
  231:17–32.
Och, Franz J. 2003. Minimum error rate
  training in statistical machine translation.
  In Proceedings of ACL, pages 160–167.
Rosti, Antti-Veikko, Necip Fazil Ayan,
  Bing Xiang, Spyros Matsoukas, Richard
  Schwartz, and Bonnie Dorr. 2007. Com-
  bining outputs from multiple machine
  translation systems. In Proceedings of
  HLT-NAACL, pages 228–235.
Shen, Libin, Jinxi Xu, and Ralph Weischedel.
  2008. A new string-to-dependency ma-
  chine translation algorithm with a target
  dependency language model. In Proceed-
  ings of ACL-HLT, pages 577–585.
Sim, Khe Chai, William Byrne, Mark Gales,
   Hichem Sahbi, and Phil Woodland. 2007.
   Consensus network decoding for statisti-
   cal machine translation system combina-
   tion. In Proceedings of ICASSP, volume 4,
   pages 105–108.
Venugopal, Ashish, Andreas Zollmann, and
  Vogel Stephan. 2007. An efficient two-
  pass approach to synchronous-CFG driven
  statistical MT. In Proceedings of HLT-
  NAACL, pages 500–507.
Zhang, Hao and Daniel Gildea. 2006. Syn-
  chronous binarization for machine trans-
  lation. In Proceedings of HLT-NAACL,
  pages 256–263.

You can also read