Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages

Page created by Walter Cortez

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Identifying main obstacles for statistical machine translation of
                   morphologically rich South Slavic languages

                Maja Popović                                                Mihael Arčan
       DFKI – Language Technology Lab                               Insight Centre for Data Analytics
               Berlin, Germany                                    National University of Galway, Ireland
         maja.popovic@dfki.de                                   mihael.arcan@insight-centre.org

                        Abstract                                come widely used in the last years: open source
                                                                tools such as Moses (Koehn et al., 2007) have
    The best way to improve a statistical ma-
                                                                made it possible to build translation systems for
    chine translation system is to identify con-
                                                                any language pair, domain or text type within days.
    crete problems causing translation errors
                                                                Despite the fact that for certain language pairs, e.g.
    and address them. Many of these prob-
                                                                English-French, high quality SMT systems have
    lems are related to the characteristics of
                                                                been developed, a large number of languages and
    the involved languages and differences be-
                                                                language pairs have not been (systematically) in-
    tween them. This work explores the main
                                                                vestigated. The largest study about European lan-
    obstacles for statistical machine transla-
                                                                guages in the Digital Age, the META-NET Lan-
    tion systems involving two morphologi-
                                                                guage White Paper Series1 in year 2012 showed
    cally rich and under-resourced languages,
                                                                that only English has good machine translation
    namely Serbian and Slovenian. Systems
                                                                support, followed by moderately supported French
    are trained for translations from and into
                                                                and Spanish. More languages are only fragmen-
    English and German using parallel texts
                                                                tary supported (such as German), whereby the ma-
    from different domains, including both
                                                                jority of languages are weakly supported. Many
    written and spoken language. It is shown
                                                                of those languages are also morphologically rich,
    that for all translation directions structural
                                                                which makes the SMT task more complex, espe-
    properties concerning multi-noun colloca-
                                                                cially if they are the target language. A large part
    tions and exact phrase boundaries are the
                                                                of weakly supported languages consists of Slavic
    most difficult for the systems, followed by
                                                                languages, where both Serbian and Slovenian be-
    negation, preposition and local word order
                                                                long. Both languages are part of to the South
    differences. For translation into English
                                                                Slavic language branch, Slovenian2 being the third
    and German, articles and pronouns are the
                                                                official South Slavic language in the EU and Ser-
    most problematic, as well as disambigua-
                                                                bian3 is the official language of a candidate mem-
    tion of certain frequent functional words.
                                                                ber state. For all these reasons, a systematic inves-
    For translation into Serbian and Slovenian,
                                                                tigation of SMT systems involving these two lan-
    cases and verb inflections are most diffi-
                                                                guages and defining the most important errors and
    cult. In addition, local word order involv-
                                                                problems can be very very beneficial for further
    ing verbs is often incorrect and verb parts
                                                                studies.
    are often missing, especially when trans-
                                                                   In the last decade, several SMT systems have
    lating from German.
                                                                been built for various South Slavic languages and
1   Introduction                                                English, and for some systems certain morpho-
                                                                syntactic transformations have been applied more
The statistical approach to machine translation
(SMT), in particular phrase-based SMT, has be-                  1
                                                                  http://www.meta-net.eu/whitepapers/
c 2015 The authors. This article is licensed under a Creative   key-results-and-cross-language-comparison
                                                                2
Commons 3.0 licence, no derivative works, attribution, CC-        together with Croatian and Bulgarian
                                                                3
BY-ND.                                                            together with Bosnian and Montenegrin

or less successfully. However, all the experiments           Different SMT systems for subtitles were devel-
are rather scattered and no systematic or focused         oped in the framework of the SUMAT project,6
research has been carried out in order to define ac-      including Serbian and Slovenian (Etchegoyhen et
tual translation errors as well as their causes.          al., 2014). However, only the translations be-
   This paper reports results of an extensive er-         tween them have been carried out as an example
ror analysis for four language pairs: Serbian and         of closely related and highly inflected languages.
Slovenian with English as well as with German,
which is also a challenging language – complex            3     Language characteristics
(both as a source and as a target language) and           Serbian (referred to as “sr”) and Slovenian (“sl”),
still not widely supported. SMT systems have              as Slavic languages, have quite free word order and
been built for all translation directions using pub-      are highly inflected. The inflectional morphology
licly available parallel texts originating from sev-      is very rich for all word classes. There are six
eral different domains including both written and         distinct cases affecting not only common nouns,
spoken language.                                          but also proper nouns as well as pronouns, adjec-
                                                          tives and some numbers. Some nouns and adjec-
2   Related work                                          tives have two distinct plural forms depending on
One of the first publications dealing with SMT sys-       the number (less than five or not). There are also
tems for Serbian-English (Popović et al., 2005)          three genders for the nouns, pronouns, adjectives
and Slovenian-English (Sepesy Maučec et al.,             and some numbers leading to differences between
2006) are reporting first results using small bilin-      the cases and also between the verb participles for
gual corpora. Improvements for translation into           past tense and passive voice.
English are also reported by reducing morpho-                As for verbs, person and many tenses are ex-
syntactic information in the morphologically rich         pressed by the suffix, and, similarly to Spanish and
source language. Using morpho-syntactic knowl-            Italian, the subject pronoun (e.g. I, we, it) is of-
edge for the Slovenian-English language pair was          ten omitted. In addition, negation of three quite
shown to be useful for both translation directions        important verbs, “biti (sr/sl)” (to be), “imati (sr) /
in (Žganec Gros and Gruden, 2007). However, no           imeti (sl)” (to have) and “hteti (sr) / hoteti (sl)” (to
analysis of results has been carried out in terms         want), is formed by adding the negative particle
of what were actual problems caused by the rich           to the verb as a prefix. In addition, there are two
morphology and which of those were solved by the          verb aspects, namely many verbs have perfective
morphological preprocessing.                              and imperfective form(s) depending on the dura-
                                                          tion of the described action. These forms are either
   Through the transLectures project,4 the
                                                          different although very similar or are distinguished
Slovenian-English language pair became available
                                                          only by prefix.
in the 2013 evaluation campaign of IWSLT.5 They
                                                             As for syntax, both languages have a quite free
report the BLEU scores of TED talks translated by
                                                          word order, and there are no articles, neither indef-
several systems, however a deeper error analysis
                                                          inite nor definite.
is missing (Cettolo et al., 2013).
                                                             Although the two languages share a large degree
   Recent work in SMT also deals with the Croat-
                                                          of morpho-syntactic properties and mutual intel-
ian language, which is very closely related to Ser-
                                                          ligibility, a speaker of one language might have
bian. First results for Croatian-English are re-
                                                          difficulties with the other. The language differ-
ported in (Ljubešić et al., 2010) on a small weather
                                                          ences are both lexical (including a number of false
forecast corpus, and an SMT system for the tourist
                                                          friends) as well as grammatical (such as local word
domain is presented in (Toral et al., 2014). Fur-
                                                          order, verb mood and/or tense formation, question
thermore, SMT systems for both Serbian and Croa-
                                                          structure, dual in Slovenian, usage of some cases).
tian are described in (Popović and Ljubešić, 2014),
however only translation errors caused by lan-            4     SMT   systems
guage mixing are analysed, not the problems re-
lated to the languages themselves.                        In order to systematically explore SMT issues re-
                                                          lated to the targeted languages, five different do-
4
 https://www.translectures.eu/                            mains were used in total. However, not all do-
5
 International Workshop on Spoken Language Translation,
                                                          6
http://workshop2013.iwslt.org/                                http://www.sumat-project.eu

(a) number of sentences                                 (b) average sentence length
    # of Sentences   sl-en    sl-de        sr-en   sr-de   Avg. Sent. Length         sl      sr         en     de
    DGT              3.2M      3M            /       /     DGT                     16.0       /        17.3   16.6
    Europarl         600k     500k           /       /     Europarl                23.4       /        27.0   25.4
    EMEA              1M       1M            /       /     EMEA                    12.7       /        12.3   11.8
    OpenSubtitles    1.8M     1.8M         1.8M    1.8M    OpenSubtitles            7.7     7.6         9.2    8.9
    SEtimes            /        /          200k      /     SEtimes                    /     22.4       23.8     /

                                       Table 1: Corpora characteristics.

mains were used for all language pairs due to              glish as well as in German in order to have a com-
unavailability. It should be noted that accord-            pletely same condition for all systems.
ing to the META-NET White Papers, both lan-                   All systems have been trained using phrase-
guages have minimal support, with only fragmen-            based Moses (Koehn et al., 2007), where the word
tary text and speech resources. For the Slovenian-         alignments were build with GIZA++ (Och and
English and Slovenian-German language pairs,               Ney, 2003). The 5-gram language model was build
four domains were investigated: DGT transla-               with the SRILM toolkit (Stolcke, 2002).
tion memories provided by the JRC (Steinberger
et al., 2012), Europarl (Koehn, 2005), Euro-               5     Evaluation and error analysis
pean Medicines Agency corpus (EMEA) in the                 The evaluation has been carried out in three steps:
pharmaceutical domain, as well as the Open-                first, the BLEU scores were calculated for each of
Subtitles7 corpus. All the corpora are down-               the systems. Then, the automatic error classifica-
loaded from the OPUS web site8 (Tiedemann,                 tion has been applied in order to estimate actual
2012). For the Serbian language, only two do-              translation errors. After that, manual inspection of
mains were available: the enhanced version of              language related phenomena leading to particular
the SEtimes corpus9 (Tyers and Alperen, 2010)              errors is carried out in order to define the most im-
containing “news and views from South-East Eu-             portant issues which should be addressed for build-
rope” for Serbian-English, and OpenSubtitles for           ing better systems and/or develop better models.
the Serbian-English and Serbian-German language
pairs. It should be noted that all the corpora             5.1    BLEU   scores
contain written texts except OpenSubtitles, which          As a first evaluation step, the BLEU scores (Pap-
contains transcriptions and translations of spoken         ineni et al., 2002) have been calculated for each of
language thus being slightly peculiar for machine          the translation outputs in order to get a rough idea
translation. On the other hand, this is the only cor-      about the performance for different domains and
pus containing all language pairs of interest.             translation directions.
   Table 1 shows the amount of parallel sentences             The scores are presented in Table 2:
for each language pair and domain (a) as well as               • the highest scores are obtained for transla-
the average sentence length for each language and                tions into English;
domain (b). For each domain, a separate system
has been trained and tuned on an unseen portion of             • the scores for translations into German are
in-domain data. Since the sentences in OpenSubti-                similar to those for translations into Slovenian
tles are significantly shorter than in other texts, the          and Serbian;
tuning and test sets for this domain contain 3000
                                                               • the scores for Serbian and Slovenian are bet-
sentences whereas all other sets contain 1000 sen-
                                                                 ter when translated from English than when
tences. Another remark regarding the OpenSubti-
                                                                 translated from German;
tles corpus is that we trained our systems only on
those sentence pairs, which were available in En-              • the best scores are obtained for DGT (which
7
                                                                 contains a large number of repetitions), fol-
  http://www.opensubtitles.org/
8                                                                lowed by EMEA (which is very specific do-
  http://opus.lingfil.uu.se/
9
  http://nlp.ffzg.hr/resources/corpora/                          main); the worst scores are obtained for spo-
setimes/                                                         ken language OpenSubtitles texts.

Domain/Lang. pair      sl-en   sr-en   sl-de     sr-de   en-sl   de-sl   en-sr   de-sr
              DGT                    77.3      /     59.3        /     72.1    58.6      /       /
              Europarl               58.9      /     33.8        /     56.0    36.5      /       /
              EMEA                   69.7      /     53.8        /     66.0    56.2      /       /
              OpenSubtitles          38.4    33.2    21.5      22.4    26.2    19.6    22.8    18.4
              SEtimes                  /     43.8      /         /       /       /     35.8      /

                               Table 2: BLEU scores for all translation outputs.

   In addition, all the BLEU scores are compared                more reordering errors can be observed in
with those of Google Translate10 outputs of all                 German outputs;
tests. All systems built in this work outperform
the Google translation system by absolute differ-            • all error rates are higher in translations from
ence ranges from 1 to 10%, confirming that the                 German than from English, except inflec-
languages are weakly supported for machine trans-              tions.
lation.
                                                            The results of the automatic error analysis are
                                                         valuable and already indicate some promising di-
5.2     Automatic error classification
                                                         rections for improving the systems, such as word
Automatic error classification has been performed        order treatment and handling morphologic gen-
using Hjerson (Popović, 2011) and the error distri-     eration. Nevertheless, more improvement could
butions are presented in Figure 1. For the sake of       be obtained if more precise guidance about prob-
brevity and clarity, as well as for avoiding redun-      lems and obstacles related to the language proper-
dancies, the error distributions are not presented       ties and differences were available (apart from the
for all translation outputs, but have been averaged      general ones already partly investigated in related
in the following way: since no important differ-         work).
ences were observed neither between domains (ex-
cept that OpenSubtitles translations exhibit more        5.3     Identifying linguistic related issues
inflectional errors than others) nor between Ser-        Automatic error analysis has already shown that
bian and Slovenian (neither as source nor as the         that different language combinations show differ-
target language), the errors are averaged over do-       ent error distributions. This often relates to linguis-
mains and two languages called “x”. Thus, four           tic characteristics of involved languages as well as
error distributions are shown: translation from and      to divergences between them. In order to explore
into English, and translation from and into Ger-         those relations, manual inspection of about 200
man.                                                     sentences from each domain and language pair an-
   The following can be observed:                        notated by Hjerson together with their correspond-
                                                         ing source sentences has been carried out.
     • translations into English are “the easiest”,         As the result of this analysis, the following has
       mostly due to the small number of morpho-         been discovered:
       logical errors; however, the English transla-
       tion outputs contain more word order errors           • there is a number of frequent error patterns,
       than Serbian and Slovenian ones;                        i.e. obstacles (issues) for SMT systems

     • all error rates are higher in German transla-         • nature and frequency of many issues depend
       tions than in English ones, but the mistransla-         on language combination and translation di-
       tion rate is especially high;                           rection

     • German translation outputs have less mor-             • some of translation errors depend on the do-
       phological errors than Serbian and Slovenian            main and text type, mostly differing for writ-
       translations from German; on the other hand,            ten and spoken language

10
 http://translate.google.com,            performed on        • issues concerning Slovenian and Serbian both
27th February 2015                                             as source and as target languages are almost

x-en
                                                                                      en-x
                                                                                      x-de
                           20                                                         de-x
         error rates [%]

                           15

                           10

                            5

                                form         order        omission addition mistrans

Figure 1: Error rates for five error classes: word form, word order, omission, addition and mistranslation;
each error rate represents the percentage of particular (word-level) error type normalised over the total
number of words.

       the same – there are only few language spe-             • negation
       cific phenomena.
                                                                 Due to the distinct negation forming, mainly
The most frequent general issues11 , i.e. relevant
                                                                 concerning multiple negations in Serbian and
for all translation directions, are:
                                                                 Slovenian, negation structures are translated
     • noun collocations (written language)                      incorrectly.
       Multi-word expressions consisting of a head
       noun and additional nouns and adjectives in                reference    the prosecution has done
       English poses large problems for all transla-                           nothing so far
       tion directions, especially from and into En-              source∗      the prosecution has not done
       glish. Their formation is different in other                            nothing so far
       languages and often the boundaries are not                 output       the prosecution is not so far
       well detected so that they are translated to un-                        had done nothing
       intelligible phrases.                                      source       Being a rector does not give
                                                                               someone the freedom
         source       12th ”Tirana Fall” art and
                                                                  reference∗   Being a rector does not give
                      music festival
                                                                               nobody the freedom
         output∗      12th ”Tirana collection fall of
                                                                  output∗      Being a rector does not
                      art and a music festival
                                                                               give some freedom
         reference the ratings agency’s first
                      Emerging Europe Sensitivity
                      Index ( EESI )                           • phrase boundaries and order
         output       the first Index sensitivity
                      Europe in the development of               Phrase boundaries are not detected properly
                      ( EESI ) this agency                       so that the group(s) of words are misplaced,
11
 Non-English parts of examples are literally translated into
                                                                 often accompanied with morphological errors
English and marked with ∗ .                                      and mistranslations.

reference    of which about a fifth is used      • missing verbs (German target)
                   for wheat production
                                                         Verbs or verb parts are missing in German
      output       of which used to produce
                                                         output, especially when they are supposed to
                   about one fifth of wheat
                                                         appear at the end of the clause.
      reference    But why have I brought
                   this thing here?                    • conjunction “i” (and) (Serbian source)
      output       This thing, but why                   The main meaning of this conjunction is
                   am I here?                            “and”, but another meaning “also, too, as
      reference    The US ambassador to Sofia            well” is often used too; however, it is usually
                   said on Wednesday .                   translated as “and”.
      output       Said on Wednesday ,
                   US ambassador to Sofia .            • adverb “lahko” (Slovenian source)
  • prepositions                                         This word is used for Slovenian conditional
                                                         phrases which correspond to English con-
    Prepositions are mostly mistranslated, some-         structions with modal verbs “can”, “might”,
    times also omitted or added.                         “shall”, or adverbs “possibly”; the entire
  • word order                                           clause is often not translated correctly due to
                                                         incorrect disambiguation.
    Local word permutations mostly affecting
    verbs and surrounding auxiliary verbs, pro-      For translation into Serbian and Slovenian, the
    nouns, nouns and adverbs.                        most important obstacles are:

Some of the frequent issues are strongly dependent     • case
on translation direction. For translation into En-       Incorrect case is used, often in combination
glish and German the issues of interest are:             with incorrect singular/plural and/or gender
                                                         agreement.
  • articles
    Due to the absence of articles in Slavic lan-      • verb inflection
    guages, a number of articles in English and          Verb inflection does not correspond to the
    German translation outputs are missing, and          person and/or the tense; a number of past
    a certain number is mistranslated or added.          participles also has incorrect singular/plurar
                                                         and/or gender agreement.
  • pronouns
    The source languages are pro-drop, therefore       • missing verbs
    a number of pronouns is missing in the En-           Verb or verb parts are missing, especially for
    glish and German translation outputs.                constructions using auxiliary and/or modal
                                                         verbs. The problem is more frequent when
  • possessive pronoun “svoj”
                                                         translating from German.
    This possessive pronoun can be used for all
    persons (“my”, “your”, “her”, “his”, “our”,        • question structure (spoken language)
    “their”) and it is difficult to disambiguate.        Question structure is incorrect; the problem
                                                         is more frequent in Serbian where additional
  • verb tense                                           particles (“li” and “da li”) should be used but
    Due to different usage of verb tenses in some        are often omitted.
    cases, the wrong verb tense from the source
    language is preserved, or sometimes mis-           • conjunction “a” (Serbian target)
    translated. The problem is more prominent            This conjunction does not exist in other lan-
    for translation into English.                        guages, it can be translated as “and” or “but”,
                                                         and its exact meaning is something in be-
  • agreement (German target)                            tween. Therefore it is difficult to disam-
    A number of cases and gender agreements in           biguate the corresponding source conjunc-
    German output is incorrect.                          tion.

• “-ing” forms (English source)                     language related patterns in distributions of error
      English present continuous tense does not ex-     types. The main part of the evaluation consisted
      ist in other languages, and in addition, it is    of (manual) analysis of errors taking into account
      often difficult to determine if the word with     linguistic properties of both target and source lan-
      the suffix “-ing” is a verb or a noun. There-     guage. This analytic analysis has defined a set
      fore words with the “-ing” suffix are difficult   of general issues which are causing errors for all
      for machine translation.                          translation directions, as well as sets of language
                                                        dependent issues. Although many of these issues
    • noun-verb disambiguation (English source)         are already known to be difficult, they can be ad-
                                                        dressed only with the precise identification of con-
      Apart from the words ending with the suf-
                                                        crete examples.
      fix “-ing”, there is a number of other English
                                                           The main general issues are shown to be struc-
      words which can be used both as a noun as
                                                        tural properties concerning multi-noun colloca-
      well as a verb, such as ”offer”, “search”, etc.
                                                        tions and exact phrase boundaries, followed by
    • modal verb “sollen” (German source)               negation formation, wrong, missing or added
                                                        preposition as well as local word order differences.
      This German modal verb can have differ-           For translation into English and German, article
      ent meanings, such as “should”, “might”           and pronoun omissions are the most problematic,
      and “will” which is often difficult to disam-     as well as disambiguation of certain frequent func-
      biguate.                                          tional words. For translation into Serbian and
                                                        Slovenian, cases and verb inflections are most dif-
   It has to be noted that some of the linguistic
                                                        ficult to handle. In addition, other problems con-
phenomena known to be difficult are not listed –
                                                        cerning verbs are frequent as well, such as local
the reason is that their overall number of occur-
                                                        word order involving verbs and missing verb parts
rences in the analysed texts is low and therefore
                                                        (which is especially difficult when translating from
the number of related errors too. For example, Ger-
                                                        German).
man compounds are well known for posing prob-
                                                           In future work we plan to address some of the
lems to natural language processing tasks among
                                                        presented issues practically and analyse the effects.
which is machine translation – however, in the
                                                        An important thing concerning system improve-
given texts only a few errors related to compounds
                                                        ment is that although most of the described issues
were observed, as well as a low total number of
                                                        are common for various domains, the exact nature
compounds. Another similar case is the verb as-
                                                        of the texts desired for the task at hand should
pect in Serbian and Slovenian – some related errors
                                                        always be kept in mind. Analysis of issues for
were detected, but their count as well as the overall
                                                        domains and text types not covered by this paper
count of such verbs in the data is very small.
                                                        should be part of future work too.
   Therefore the structure and nature of the texts
for a concrete task should always be taken into ac-     Acknowledgments
count. For example, for improvements of spoken
language translation more effort should be put in       This publication has emanated from research sup-
question treatment than in noun collocation, and in     ported by the QT21 project – European Union’s
technical texts the compound problem would prob-        Horizon 2020 research and innovation programme
ably be significant.                                    under grant number 645452 as well as a research
                                                        grant from Science Foundation Ireland (SFI) under
6    Conclusions and future work                        grant number SFI/12/RC/2289. We are grateful to
                                                        the anonymous reviewers for their valuable feed-
In this work, we have examined several SMT
                                                        back.
systems involving two morphologically rich and
under-resourced languages in order to identify
the most important language related issues which
should be dealt with in order to build better sys-
tems and models. The BLEU scores are reported
as a first evaluation step, followed by automatic
error classification which has captured interesting

References                                                   Translation Output. The Prague Bulletin of Mathe-
                                                             matical Linguistics, (96):59–68, October.
Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa
  Bentivogli, and Marcello Federico. 2013. Report on       Sepesy Maučec, Mirjam, Janez Brest, and Zdravko
  the 10th IWSLT evaluation campaign. In Proceed-            Kačič. 2006. Slovenian to English Machine Trans-
  ings of the International Workshop on Spoken Lan-          lation using Corpora of Different Sizes and Morpho-
  guage Translation (IWSLT), Heidelberg, Germany,            syntactic Information. In Proceedings of the 5th
  December.                                                  Language Technologies Conference, pages 222–225,
                                                             Ljubljana, Slovenia, October.
Etchegoyhen, Thierry, Lindsay Bywood, Mark Fishel,
  Panayota Georgakopoulou, Jie Jiang, Gerard Van           Steinberger, Ralf, Andreas Eisele, Szymon Klocek,
  Loenhout, Arantza Del Pozo, Mirjam Sepesy                   Spyridon Pilos, and Patrick Schlüter. 2012. DGT-
  Maučec, Anja Turner, and Martin Volk. 2014.                TM: A freely available Translation Memory in 22
  Machine Translation for Subtitling: A Large-Scale           languages. In Proceedings of the 8th International
  Evaluation. In Proceedings of the Ninth Interna-            Conference on Language Resources and Evaluation
  tional Conference on Language Resources and Eval-           (LREC12), pages 454–459, Istanbul, Turkey, May.
  uation (LREC14), Reykjavik, Iceland, May.
                                                           Stolcke, Andreas. 2002. SRILM – an extensible lan-
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris            guage modeling toolkit. volume 2, pages 901–904,
  Callison-Burch, Marcello Federico, Nicola Bertoldi,         Denver, CO, September.
  Brooke Cowan, Wade Shen, Christine Moran,
  Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra       Tiedemann, Jörg. 2012. Parallel data, tools and inter-
  Constantin, and Evan Herbst. 2007. Moses: Open              faces in OPUS. In Proceedings of the 8th Interna-
  source toolkit for statistical machine translation. In      tional Conference on Language Resources and Eval-
  Proceedings of the 45th Annual Meeting of the ACL           uation (LREC12), pages 2214–2218, May.
  on Interactive Poster and Demonstration Sessions,
  Stroudsburg, PA, USA.                                    Toral, Antonio, Raphael Rubino, Miquel Esplà-Gomis,
                                                             Tommi Pirinen, Andy Way, and Gema Ramirez-
Koehn, Philipp. 2005. Europarl: A Parallel Corpus for        Sanchez. 2014. Extrinsic Evaluation of Web-
  Statistical Machine Translation. In Proceedings of         Crawlers in Machine Translation: a Case Study on
  the 10th Machine Translation Summit, pages 79–86,          CroatianEnglish for the Tourism Domain. In Pro-
  Phuket, Thailand, September.                               ceedings of the 17th Conference of the European
                                                             Association for Machine Translation (EAMT), pages
Ljubešić, Nikola, Petra Bago, and Damir Boras. 2010.       221–224, Dubrovnik, Croatia, June.
  Statistical machine translation of Croatian weather
  forecast: How much data do we need? In Lužar-           Tyers, Francis M. and Murat Alperen. 2010. South-
  Stiffler, Vesna, Iva Jarec, and Zoran Bekić, editors,     East European Times: A parallel corpus of the
  Proceedings of the ITI 2010 32nd International Con-        Balkan languages. In Proceedings of the LREC
  ference on Information Technology Interfaces, pages        Workshop on Exploitation of Multilingual Resources
  91–96, Zagreb. SRCE University Computing Centre.           and Tools for Central and (South-) Eastern European
                                                             Languages, pages 49–53, Valetta, Malta, May.
Och, Franz Josef and Hermann Ney. 2003. A Sys-
  tematic Comparison of Various Statistical Alignment      Žganec Gros, Jerneja and Stanislav Gruden. 2007. The
  Models. Computational Linguistics, 29(1):19–51,             voiceTRAN machine translation system. In Pro-
  March.                                                      ceedings of the 8th Annual Conference of the In-
                                                              ternational Speech Communication Association (IN-
Papineni, Kishore, Salim Roukos, Todd Ward, and Wie-          TERSPEECH 07), pages 1521–1524, Antwerp, Bel-
  Jing Zhu. 2002. BLEU: a method for automatic                gium, August. ISCA.
  evaluation of machine translation. pages 311–318,
  Philadelphia, PA, July.
Popović, Maja and Nikola Ljubešić. 2014. Explor-
  ing cross-language statistical machine translation for
  closely related South Slavic languages. In Proceed-
  ings of the EMNLP14 Workshop on Language Tech-
  nology for Closely Related Languages and Language
  Variants, pages 76–84, Doha, Qatar, October.
Popović, Maja, David Vilar, Hermann Ney, Slobodan
  Jovičić, and Zoran Šarić. 2005. Augmenting a
  Small Parallel Text with Morpho-syntactic Language
  Resources for Serbian–English Statistical Machine
  Translation. pages 41–48, Ann Arbor, MI, June.
Popović, Maja. 2011. Hjerson: An Open Source
  Tool for Automatic Error Classification of Machine

You can also read