Slator 2018 Neural Machine Translation Report

Slator 2018 Neural Machine Translation Report
Slator 2018
Neural Machine
Slator 2018 Neural Machine Translation Report
Slator 2018 NMT Report |

                                      The mention of any public or private entity in this report does not constitute an official
              Slator Reports
                                      endorsement by Slator.

                                      The information used in this report are either compiled from publicly accessible
                                      resources (e.g. pricing web pages for technology providers), acquired directly from
                                      company representatives, or both. The insights and opinions from Slator’s expert
                                      respondents are theirs alone, and Slator does not necessarily share the same position
                                      where opinions are attributed to respondents.

                                      Slator would like to thank the following for their time, guidance, and insight, without which
                                      this report could not have come into fruition:

                                      Andrew Rufener, CEO, Omniscien Technologies

                                      Diego Bartolome, Machine Translation Manager, Transperfect

                                      Jean Senellart, CTO, Systran

                                      John Tinsley, CEO, Iconic Machine Translations

                                      Joss Moorkens, post-doctoral researcher, ADAPT Center and Lecturer, Dublin City University

                                      Kirti Vashee, Language Technology Evangelist, SDL

                                      Kyunghyun Cho, Assistant Professor and pioneer researcher, New York University

                                      Maja Popovic, Researcher, DFKI – Language Technology Lab, Berlin, Germany

                                      Mihai Vlad, VP Machine Learning Solutions, SDL

                                      Pavel Levin, Researcher,

                                      Rico Sennrich, pioneer and post-doctoral researcher, MT Group, University of Edinburgh

                                      Samad Echihabi, VP of Research and Product Development, SDL

                                      Samuel Laubli, PhD candidate and CTO, Textshuttle

                                      Silvio Picinini, MT Expert, eBay

                                      Spence Green, CEO, Lilt

                                      Tony O’Dowd , CEO, Kantan MT

                                      Yannis Evangelou, Founder, LexiQA

     PAGE: 2 // 37
Slator 2018 Neural Machine Translation Report

of Contents
 Executive Summary                                                    4

 Neural is the New Black                                              6

       The Current NMT Landscape                                      6

       So, What Now?                                                  7

 By the End of 2017, Neural MT was Mainstream                         7

                Neural Network-Based Language Technology Providers    8

       Current Customized NMT Deployments                            10
               EPO’s Patent Translate                                10
      A Trial NMT Deployment                    11

       NMT Performance: What NMT Can and Cannot Do                   12
              Exceptional Capabilities of NMT                        14
              Current Limitations of NMT                             16

 What’s Next in NMT                                                  20

       How Do You Quantify Quality?                                  20
              Replacing BLEU                                          21
              Human Evaluation Remains the Ultimate Standard          21

       Creating a New Quality Standard                               22
               New and Existing NMT QA Processes                     22
               Machines Testing Machines                             24

       Training Data Becomes Big(ger) Business                       24
               Publicly Accessible Corpora                           25
               Building Your Own Corpora                             25
               Buying Corpora from Others                            25
               Quality is Always a Caveat                            26

       Directions of NMT Research                                    27
               “So Many” Exciting Research Directions                27
               “Convolutional Neural Networks are Doomed”            28
               Pivot Languages and Zero Shot                         28

 Buy Vs Build                                                        29

       Quality Versus Cost                                           30

       Productivity and Production Boost?                            31

       To Build or Not To Build                                      32

       Shifting Paradigms: Changing Models of Working                34
               From the Experts                                      34

                                                                           PAGE: 3 // 37
Slator 2018 Neural Machine Translation Report
Slator 2018 NMT Report |

                     Neural has become the new standard
                     in machine translation.

                     There are now over 20 providers of neural network-      Machine translation, while still niche compared to its
                     based language technology—four times as many as         human translation counterpart, is a growth market. There
                     little over a year earlier.                             are also emerging sub-markets driven by the need for high
                                                                             quality training data. This begs the question, however, on
                     Both the public and private sectors have shown          whether commoditized data can actually provide the in-
                     effective use cases for the emerging technology,        domain, high quality data required by neural engines.
                     for example: reducing 16,000 man-hours into
                     instant, fluent patent translation, and cutting         Is commoditized training data worth the millions of
                     multilingual content delivery costs by speeding         dollars that Sogou Inc. invested in UTH International,
                     up translation and shortening the pipeline to           for instance? Is quality training data so quintessential
                     digital publishing.                                     that Baidu can afford to sell human translation at rock
                                                                             bottom rates to accelerate its acquisition?
                     Today, even as ongoing research continues to
                     leverage the exceptional capabilities of neural         Effective translation engines can be built with as little
                     machine translation and come up with workarounds        as a few hundred thousand and as much as a billion
                     to its limitations, important questions that span the   high quality sentence pairs, depending on the use case,
                     entire language industry are arising.                   domain, and technology. And attempts are made at
                                                                             zero-shot models where no reference data is available.
                     How does research move forward in terms                 Several open-source tools are available to DIY, but
                     of automatically defining output quality and            expert consensus recommends otherwise.
                     incremental gains? How does the industry
                     efficiently infuse automated processes with much-       Indeed, many would “fail” at building their own industry-
                     needed human evaluation? And how are human              proof neural machine translation models, according to
                     translators going to interact with much improved        experts. They need a combination of high quality training
                     machine output?                                         data, the right technology, and an expert team in place.

     PAGE: 4 // 37
Slator 2018 Neural Machine Translation Report

The prices of graphics processing units (GPUs) used to
crunch NMT models as well as costs associated with
training will steadily decline, enabling the machines
to feed on ever bigger data sets. As neural machine
translation ripples through the language services
supply chain, increasing adoption will disrupt the way
the industry operates.

The role of the linguist is evolving, shifting further into
the language technology pipeline both in terms of
pre-translation and post-translation responsibilities.
Human and machine interaction is expected to
change both to adapt to new neural network-based
technology and to improve overall usability of existing
tools. Finally, industry-wide changes in pricing
models and opportunities for technological thought
leadership are afoot.

Finding your place in all this technologically-enabled
change requires an understanding of the catalyst for it:
neural machine translation.

                                                              PAGE: 5 // 37
Slator 2018 Neural Machine Translation Report
Slator 2018 NMT Report |

     Neural is the New Black
                            Google, Microsoft, IBM, SAP, Systran, Facebook, Amazon, Salesforce, Baidu, Sogou, SDL, DeepL—and this is the
                            short version of a much longer list that includes KantanMT, Omniscien, Lilt, Globalese, and TransPerfect (via
                            Tauyou) and many other startup and mid-sized players.

                            These companies have all become involved in neural machine translation (NMT). Some of them offer NMT
                            solutions. Some use proprietary systems for unique problems. Others are researching ways to incorporate NMT
                            into their existing service and product lines.

                            By now, the generic praise heaped upon the new technology is becoming repetitive: it outperforms statistical machine
                            translation (SMT), it is a genuine breakthrough in AI tech, and it is fast-paced in terms of research and deployment.

                            The industry is well past discussing the emergence of NMT. Clearly, neural is the new black. Now the main concern
                            is to see if you look good in black.

                            The Current NMT Landscape

                            It can be argued that the language services market is at an early stage of tech adoption with regards to NMT,
                            since the current NMT landscape is the result of about four years of research and deployment. The fact is,
                            however, that barely four years of NMT research has already eclipsed around two decades of SMT research.

                                      “It’s fair to say that the industry has condensed 15 years of statistical research
                                      into 3 years of NMT research and produced systems that will outperform the best
                                      SMT can offer.”

                                      —Tony O’Dowd, CEO, KantanMT

                            Speaking at SlatorCon New York in October 2017, Asst. Professor Kyunghyun Cho of New York University said
                            NMT adoption is incredibly fast-paced. He cited two examples of rapid deployment: Google Translate and
                            Facebook—each taking around a year from zero to full deployment.

                                      “It only took about a year and a half for Google to replace the entire system they
                                      built over 12 years into this new system. It’s getting shorter and shorter everyday.
                                      Facebook noticed that Google is doing it, they have to do it as well. If you look
                                      at the date [of announcement]—it’s one year since Google. Google deploys it
                                      in production without telling anybody in detail what they have done, but other
                                      companies have been able to catch up within a year. It’s an extremely fast-moving
                                      field in that every time there is some change, we see the improvement.”

                                      —Asst. Professor Kyunghyun Cho speaking at SlatorCon New York, October 2017

                            In 2015, when Slator first started publishing, NMT was only just starting to build momentum. In February 2017,
                            only Google Translate, Microsoft Translator, and Systran’s Pure Neural provided NMT. Now, Slator has found
                            there are at least 20 language technology providers of various technologies based on or related to NMT.

     PAGE: 6 // 37
Slator 2018 Neural Machine Translation Report
Neural is the New Black

So, What Now?

This report is developed with a specific audience in mind—an audience that wants to learn about the current state of
NMT. They want to find out what it can and cannot do, delve into use cases to better understand its applications, and
discover ongoing and future research directions, as well as the broader trends that NMT has jumpstarted.

This report first delves into NMT developments in 2017, the year the technology irrevocably went mainstream, and
will then go briefly through some NMT technology providers and users to illustrate the state of NMT last year.

We will look at the exceptional qualities of NMT to clarify what it can really do and what is likely to be hyperbole.
Then, we will touch on the limitations of the technology, with input from subject matter experts from both
academia and the corporate world.

As the industry adjusts to NMT as a new standard technology alongside existing ones like SMT and translation
memory, it also needs to address the increased need for high quality training data and the sub-category of
businesses that are starting to cash in on the demand. This report will discuss this demand for more training data
and a few efforts towards open sourcing both data and technology to encourage growth in NMT. We will also take
a look at quality assurance and post-editing in an NMT-powered industry.

Finally, through a few use cases and insight from subject matter experts, we will address two big questions: should
you buy or build an NMT system? And how does this new normal change your existing ways of working?

                                By the End of 2017,
                        Neural MT was Mainstream
    Key Takeaways
                         ▶                  2017 was marked with a number of major NMT release announcements
                                            from big players—the number of technology providers spiked from less
                                            than five a year ago to over twenty, generally categorized into enterprise
                                            providers led by Big Tech and independent, boutique providers with
                                            more specialized fields.

                         ▶                  Current production-level NMT deployments include the European Patent
                                            Office and, use cases that highlight the production-level
                                            capabilities of NMT specifically for translation pipelines with repetition
                                            and stringent guidelines (patents), as well as the massive scale of real-
                                            time content (’s descriptions).

                         ▶                  While NMT continues to evolve, there are clear exceptional capabilities
                                            of the technology that will continue to give it an edge in the future,
                                            even as its current limitations are being tackled through research.

                                                                                                                          PAGE: 7 // 37
Slator 2018 Neural Machine Translation Report
Slator 2018 NMT Report |

     By the End             Slator coverage for NMT in 2017 saw Amazon going from buying a machine translation startup to building up manpower
     of 2017,               to actually launching their offering; and saw Facebook say SMT is at end of life and finally completing the switch to NMT.
     NMT was
     (cont.)                Then there were initiatives to make sure NMT becomes easier to adopt and develop as a technology, such as
                            Harvard and Systran’s OpenNMT project, Google’s public tutorial on the use of its neural network library called
                            TensorFlow, Facebook’s open sourcing of their NMT tech and Amazon’s open sourcing of their machine learning
                            platform Sockeye.

                            There were also buzz-making launches and announcements, such as DeepL’s unveiling and Sogou’s million dollar
                            investment in UTH International’s training data.

                            Neural Network-Based Language Technology Providers

                            The most highly publicized NMT launches in 2017 (and earlier) were from cloud-based enterprise platforms
                            like Google, Microsoft, Baidu and Amazon. These same “Big Tech” companies often offer free, browser-based,
                            generic engines anyone can use.

                                                                               Free, Browser-Based, Generic NMT Engines

                                                                               •     Baidu

                                                                               •     DeepL 		          

                                                                               •     Google		          

                                                                               •     GTCom		           

                                                                               •     Microsoft         
                                                                                     and also used on  

                                                                               •     PROMT             

                                                                               •     Systran           

                                                                               •     Tilde             

                                                                               •     Yandex		          

                                                                                     This is not an exhaustive list.

                            These free, generic engines are usually provided for demonstration purposes, though the more well-known ones
                            such as Google, Bing Translate and DeepL are used by millions of people for straightforward translation tasks.

                            These Big Tech players all share a common rationale behind their move towards NMT: pitch it to their user base
                            of cloud platform clients as a value-add to their existing service areas. This is also Salesforce’s plan, once they
                            complete their R&D—NMT is intended to be part of a suite of AI-powered features for international clients with
                            multilingual needs.

                            Interestingly, Amazon pitched its NMT offering to LSPs directly, though again with the caveat that those who would
                            benefit most are existing users of Amazon Web Services. As of this writing, Amazon does not offer a browser-based,
                            free, generic NMT engine.

                            The pricing structure for Big Tech NMT providers are mostly fixed, compared to boutique or specialized providers
                            where the flexibility and customizability of services are reflected in equally flexible and bespoke pricing.

     PAGE: 8 // 37
Slator 2018 Neural Machine Translation Report
By the End of 2017, Neural MT was Mainstream        

The Big Tech usually employ either a straightforward pay-as-you-go pricing strategy,                                      Neural
or augment that with tiers:                                                                                               Network-
•   Amazon Translate - USD 15 per million characters. For the first 12 months from the date of the first                  Technology
    translation request, users can translate two million characters a month free of charge.

•   Google Translate - USD 20 per million characters. There are custom pricing considerations for the
    Translation API on embedded devices (e.g. TVs, cars, speakers, etc) as well as for translation volume
    exceeding a billion characters monthly.

•   IBM Watson Language Translator - Lite, Standard, and Advanced “plans:”

    •    At Lite, users can translate one million characters a month free of charge using default translation models.

    •    At Standard pricing, the first 250,000 characters translated are free; every thousand characters beyond
         costs USD 0.02. The Standard plan includes news, conversational, and patent translation models.

    •    The Advanced plan is also priced at USD 0.02 per thousand characters translated using standard
         models, but translation using custom models cost USD 0.10 per thousand characters. The Advanced
         plan also includes custom model maintenance at USD 15 monthly per model, pro-rated daily.

•   Microsoft Translator - Free use for the first two million characters; pay-as-you-go rate of USD 10 monthly
    for every million characters after the first two million. Users with regular, massive translation volumes can
    leverage discounts from tiered monthly commitments:

    •    USD 2,055.01 monthly for 250 million characters with overage rate of USD 8.22 for every million
         characters above the limit.

    •    USD 6,000 monthly for one billion characters with overage rate of USD 6 for every million characters
         above the limit.

    •    USD 45,000 monthly for ten billion characters with overage rate of USD 4.50 for every million characters
         above the limit.

•   SAP Translation Hub - EUR 39 (USD 48) for every “bucket” of 100,000 characters per year. Requires SAP
    Cloud Platform license, though the service can be requested “A-la-carte.”

•   Yandex Translate - Starts at USD 15 per every million characters with reduced prices for larger volumes:

    •    USD 15 per million characters monthly for less that 50 million characters a month.

    •    USD 12 per million characters monthly for over 50 million and under 100 million characters a month.

    •    USD 10 per million characters monthly for translation requests between 100 and 200 million characters a month.

    •    USD 8 per million characters monthly for translation requests between 200 and 500 million characters a month.

    •    USD 6 per million characters monthly for translation requests between 500 million and 1 billion
         characters a month.

    •    Custom pricing for over a billion characters translated every month.

•   DeepL - No pricing available but still a heavily trafficked free online engine. (DeepL Pro, a API for developers,
    was launched at printing on March 20th).

                                                                                                                            PAGE: 9 // 37
Slator 2018 NMT Report |

     Neural                 There are also boutique and specialized language technology providers such as Systran, Lilt, and KantanMT, who
     Network-               offer their neural network-based language technology without the requisite cloud platform. Some of these include:
     Technology             •    Globalese MT
                            •    GTCom (Chinese provider of AI-based technology)
                            •    Iconic Translation Machines

                            •    KantanMT

                            •    Lilt (adaptive MT)

                            •    National Institute of Information and Communications Technology (NICT) in Japan

                            •    Omniscien Technologies

                            •    PROMT

                            •    Systran

                            •    Tilde NMT

                            These companies are typically language technology, machine learning, or ICT companies in general. Some of them
                            focus on a very specific neural network-based technology, such as Lilt’s predictive, machine learning, adaptive MT.
                            Some companies like Systran are very early adopters of NMT, and continue to contribute to open source initiatives.
                            Others still are active in the research scene. KantanMT and Tilde, for instance, are involved in current consortiums
                            within the European Council’s many language technology initiatives.

                            Finally, some LSPs have also developed NMT offerings in-house or through acquisitions. They typically leverage
                            existing parallel corpora and linguistic expertise for the development of their NMT services. Some LSPs that offer
                            NMT technology today include:

                            •    Pangeanic (PangeaMT)

                            •    SDL

                            •    TransPerfect

                            •    United Language Group (powered by Lucy Software)

                            Current Customized NMT Deployments

                            The NMT provider ecosystem is already taking shape and includes large enterprise players as well as boutique
                            providers. In a consumer context, the technology has been deployed extensively by Google, DeepL, Microsoft and
                            other free online services. As many language service providers are now integrating NMT into their supply chain, we
                            wanted to highlight two high profile (and high volume) customized deployments of NMT:

                            EPO’s Patent Translate

                            Patent Translate is a free translation service the European Patent Office (EPO) launched in 2013 in association
                            with Google. Slator reported in June 2017 that the EPO has moved this service to NMT. At the time the article was
                            published, the EPO used NMT to translate eight (Chinese, French, German, Japanese, Korean, Portuguese, Spanish,
                            and Turkish) of the 28 supported languages.

     PAGE: 10 // 37
By the End of 2017, Neural MT was Mainstream        

The EPO blog post announcing the move noted: “The NMT solution is producing significant improvements in the                    EPO’s
quality of translated texts.”                                                                                                  Patent
EPO President Benoît Battistelli told Slator in that article that the EPO received approximately 15,000 translation requests
on average per day, mostly from India, Japan, Russia and the US, in addition to requests from EPO member states.

The EPO’s use case appears to paint a very positive picture for NMT. For instance, according to Battistelli, it would
take 16,000 man-years to translate the Chinese patent documentation available at the time into English. Meanwhile,
through NMT, Patent Translate provides all that documentation in EPO’s three official languages instantly.

Of course, Patent Translate’s NMT engines are trained specifically in this domain. Our article quoted Battistelli saying
they set a threshold of several tens of thousands of human translations in a language “corpus” before it considers
offering the language in Patent Translate.

If nothing else, the EPO’s case study shows that processing large volumes of automated patent translation with
high quality output is feasible. But there’s a caveat: the training data. Patent Translate is trained with millions of
highly specialized, in-domain data. In fact, the EPO’s Espacenet, a free online service for searching patents and
patent applications, includes more than 100 million source patent texts from over 90 patent granting authorities,
written in many languages.

According to Jean Sellenart, CTO of Systran: “To create a high quality NMT engine, one would need about 20 million
sentences, but there is no upper limit—the more data we put, the better it becomes.”

Diego Bartolome, Machine Translation Manager of Transperfect, hedged his bet at 1 billion sentences. He did note
however, that they “produced a fluent NMT engine for Korean with only two million words in the training materials.
So it’s not necessarily that a neural MT engine needs more data. It depends on the goal and its scope.”

“To build a general purpose MT engine the likes of Google Translate, massive volumes of data are needed—tens of
millions of sentence pairs,” said John Tinsley, CEO of Iconic Translation Machines. He provided supporting context
on the not-so-simple relationship between training data and NMT engines: “For domain-specific engines, it totally
depends on the use case, the variety of source content to be translated, vocabulary size, etc. We’ve built production
engines with as little as one million sentence pairs and as much as 48 million sentence pairs.”

Travel fare aggregator and lodging reservations website found that the convergence of three major
technology trends gave them a golden opportunity to try out a production-level NMT system. There was demand for
local language content, access to cheap computing power in the cloud, and now, open-source NMT frameworks. A Trial NMT Deployment

They built an NMT system from OpenNMT and tried it out. They published a research paper on Arxiv detailing their findings:

1.   NMT consistently outperforms SMT;

2.   Performance degraded for longer sentences, but even then NMT still outperformed SMT;

3.   In the case of German, in-house NMT performed better than online general purpose engines;

4.   Fluency of NMT is close to human translation level. reportedly handles 1.55 million room night reservations a day, and the company offers content in 40
different languages. The research paper indicated that property descriptions (hotels, apartments, B&Bs, hostels, etc.)
were a main use case for NMT.

                                                                                                                                PAGE: 11 // 37
Slator 2018 NMT Report |           Additionally, the paper goes on to note that in-house MT systems can increase translation efficiency “by increasing
     A Trial NMT            its speed and reducing the time it takes for a translated property description to appear online, as well as significantly
                            cutting associated translation costs.”

                            Maxim Khalilov, Commercial Owner of, said they identified ten use cases for MT within the company and
                            they will focus on these according to a list of priorities.

                  ’s trial run of an NMT system was motivated by corporate priorities and enabled by access to required
                            technology. Is it a good example of a corporate user that can build its own system instead of relying on service providers?

                            Pavel Levin, Senior Data Scientist at and lead author of the research paper, provides more insight into
                            the matter.

                            “It is true that there are many deep learning frameworks out there which make it relatively easy to build your own
                            NMT system, however your system will only be as good as the data you have,” he said. “In our case we have millions of
                            relevant parallel sentences, which makes it a perfect setting for rolling out our own system. However, if you are a small
                            startup with no data, or getting into a new business area, it might be easier to just buy services from one of several
                            existing commercial general-purpose engines, assuming their quality and usage constraints (legal, technical) suit you.”

                            Language Technology Evangelist Kirti Vashee counseled against attempting to build your own system simply
                            because of the availability of technology such as OpenNMT. “It would be unwise to build NMT or DIY without deep
                            expertise in machine learning, data analysis and preparation, and overall process management skills,” he said. “My
                            recommendation is to find experts and work closely with them.”

                                      “This (building a system from scratch) is not an actual possibility for most of the
                                      translation industry players. I would expect most would fail with such a strategy.”

                                      —Kirti Vashee, Language Technology Evangelist, SDL

                            Asked if it is a question of maturity—whether you should buy until you can build, Vashee said “possibly, but for now
                            buy is a wiser strategy.”

                            NMT Performance: What NMT Can and Cannot Do

                            Most of the hype surrounding NMT is due to its perceived superiority. It is consistently better than predecessor
                            technology in most areas, particularly in terms of translation output fluency.

                            One of the most widely talked about NMT-related news in 2017 was the launch of technology provider DeepL,
                            developed by the founders of, an online dictionary launched in 2009. In Slator’s own, wholly
                            unscientific test, we pitted DeepL against Google Translate by having both translate three short paragraphs taken
                            from a Bloomberg article from English to German.

                            Our anecdotal experiment on DeepL’s general purpose engine showed that it was indeed somewhat more
                            fluent for shorter sentences. One translation stood out as quite accurate and indeed much more fluent for that
                            particular target sentence compared to Google Translate. Meanwhile, no Google-translated sentence seemed
                            unambiguously better than its DeepL counterpart.

     PAGE: 1 2 // 37
By the End of 2017, Neural MT was Mainstream   

We also noticed that in longer sentences, DeepL and Google Translate both broke down.                                  NMT Per-
                                                                                                                       What NMT
DeepL appears to have impressed mainstream media more than the obvious and unavoidable baseline
                                                                                                                       Can and
comparison: Google Translate. So is it the industry player to beat?                                                    Cannot Do
Not necessarily. The only conclusion we can draw is that DeepL’s generic NMT engine is usually better than
Google’s generic NMT engine.

So how much and what sort of data is required to create a fluent engine?

“This is the question no one has or can give a definite answer to,” said Kyunghyun Cho, Assistant Professor at New
York University, one of the pioneers in NMT research. “It depends not only on the problem (target language pairs)
or the quality of data (which is after all not even well-defined rigorously), but also on what kind of learning
algorithm is used (some learning algorithms are more robust to noise in data, while some others are more
sensitive) and what kind of model is used (some neural net architectures are more robust while some others
are more sensitive).”

From a corporate perspective, Omniscien CEO Andrew Rufener essentially agreed and said the fluency of any NMT
engine “depends very much on the domain and the language pair. The more complex the domain or the broader
and more complex the language, the more data is needed.”

It turns out “how much” data is the wrong question. It is more about “how good” the data is.

        “The general finding is that NMT systems benefit from cleaner and more diverse
        training corpora rather than a massive unfiltered corpus, which was typically best
        for phrase-based systems.”

        —Spence Green, CEO, Lilt

Tony O’Dowd, CEO of KantanMT, went further and said “there is no direct correlation between the amount of
training data and the quality of a NMT engine.” What was important, he said, was to use “highly cleansed and
aligned training data for the purposes of building successful NMT engines.”

“We find we can build NMT engines using less data than our equivalent SMT engines, with the caveat that it’s of
much higher quality,” O’Dowd added.

So training data remains important, but the quality is key.

Furthermore, the quantity of additional training data dumped into engines will not yield equivalent leaps in

        “The learning curve of neural machine translation systems is roughly logarithmic.
        Every doubling of the training data gives a small bump in quality.”

        —Rico Sennrich, Post-doctoral researcher, University of Edinburgh

Gauging fluency improvements on NMT is not as simple as asking how much data is needed, but there is consensus
on NMT’s edge over SMT, among other things.

                                                                                                                        PAGE: 1 3 // 37
Slator 2018 NMT Report |

                            Exceptional Capabilities of NMT

                            At all three SlatorCons in 2017, we had a NMT
                            expert present on various aspects of the popular
                            technological trend. TextShuttle CTO Samuel
                            Läubli’s presentation in SlatorCon Zürich in
                            December 2017, in particular, centered around
                            the benefits of NMT compared to predecessor
                            technology and in general.

                            Indeed, between the very first research paper
                            on Arxiv on NMT to today, there are a few
                            advantages NMT has touted over SMT.

                            1.   NMT is more fluent.                                   Kyunghyung Cho, Asst. Professor, NYU - SlatorCon New York 2017

                                 In his SlatorCon presentation, Läubli explained how NMT engines consider entire sentences while SMT
                                 considers only a few words at a time, so the result is that NMT’s output is often more fluent.

                                 SMT systems would evaluate the fluency of a sentence in the target language a few words at a time using an
                                 N-gram language model, Läubli said. “If we have an N-gram model of order 3, when it generates a translation,
                                 it will always assess the fluency by looking at n-1 previous words,” he said in his presentation. “So in this case
                                 it would be two previous words.”

                                 This means that given a sentence of any length, an SMT system with a 3-gram language model will make sure
                                 every three words would be fluent together. “The context is always very local. As new words are added, we
                                 can always look back, but only to a very limited amount, basically,” Läubli said.

                                 On the other hand, NMT models use recurrent neural networks. According to Läubli, “the good thing here is
                                 that we can condition the probability of words that are generated at each position on all the previous words
                                 in that output sentence.”

                                 In essence, where SMT is limited to how many words its N-gram model dictates, NMT evaluates fluency for the
                                 entire sentence.

                            2.   NMT makes better translation choices.

                                 Both SMT and NMT, in the simplest sense, function using numerical substitution—i.e. they replace words with
                                 numbers, and then proceed to perform mathematical equations on those numbers to translate.

                                 In an extremely simplified perspective, SMT more or less uses random numbers, in the way that two related
                                 words would have numbers that aren’t related. Läubli gave two example sentences in his presentation where
                                 only one word is different, but used the same way: one sentence used “but” and the other used “except.”

                                 SMT systems, he said, would for example assign values like ID number 9 and 2 to both words respectively, and
                                 therefore not relate them in any way. On the other hand, NMT systems would assign values like 3.16 and 3.21,
                                 essentially placing them close together if the training data shows their use to be fairly similar.

                                 “NMT systems capture the similarity of words and can then benefit from that,” Läubli said.

     PAGE: 14 // 37
By the End of 2017, Neural MT was Mainstream     

3.   NMT can choose translations that would rarely occur in training corpora.                                           Exceptional
                                                                                                                        of NMT
     In Kyunghyun Cho’s presentation at SlatorCon New York, he said NMT was very robust when it came to
     spelling mistakes, and additionally, “it can actually handle many of those very rare compound words.” He
     explained the NMT system they trained can translate into compound words that rarely appear in a training
     corpus the size of 100 million words.

     “It can also handle the morphology really well,” Cho said. “It can even handle the, let’s say, ironical compound
     word.” During the presentation, he pointed to an example in his slide: “This means “worsening improvement”
     in German, that’s actually ironic—how is the improvement worsening? This character to character level [of
     NMT] is able to handle that perfectly.”

     Joss Moorkens, Assistant Professor at Dublin City University and post-doctoral researcher at ADAPT, said “the
     use of byte pair encoding (breaking words into sub-word chunks) helps with translation of unseen words (that
     don’t occur in the training data), but can also result in NMT neologisms – non-dictionary compound words
     that the system has created.”

4.   NMT can automatically code-switch.

     According to Cho in his SlatorCon New York presentation, NMT “automatically learns how to handle code-
     switching inside a sentence.”

     He provided an example: “We decided to train a NMT system to translate from German, Czech, Finnish, and
     Russian to English. We’re going to give it a sentence in any of these four languages and ask the system to
     translate it into the corresponding English sentence. We didn’t give any kind of language identifier.”

     The system they trained did not need to identify the source languages. “Now, since our model is actually of
     the same size as before, we are saving around four times the parameters. Still, we get the same level or better
     performance,” Cho said. “In a human evaluation especially in terms of fluency, this model beats any single
     paired model you can think of.”

     Cho continued to elaborate on NMT’s code-switching: “Once we trained the model we decided to make up
     these kinds of sentences: it starts with German and then Czech, back to German and into Russian and ends
     with German. The system did the translation as it is, without any kind of external indication which part of the
     sentence was written in which language.”

5.   NMT reduces post-editing effort.

     Back to Läubli’s SlatorCon Zürich presentation, he said NMT reduces post-editing effort by about 25%.

     Rufener, CEO of Omniscien, and Tinsley, CEO of Iconic, both agreed that there would be significant gains in
     productivity given high quality MT engines, though the former cautioned that it again depends on the use-
     case and the latter “will not hazard to suggest a number.”

     O’Dowd, CEO of KantanMT, turned to client use-cases: “Our clients are finding that high fluency clearly leads
     to business advantages. Since the vast majority of MT output is not post-edited, high fluency will lead to high
     adequacy which in-turn leads to high usability. This means that the translations are closer to 'fit-for-purpose'
     scenarios than ever previously experienced using SMT approaches. Instant-chat, Technical and Customer
     Support portals, Knowledge bases, these can all now be translated at levels of quality previously unheard of.”

                                                                                                                         PAGE: 1 5 // 37
Slator 2018 NMT Report |

                            Current Limitations of NMT

                            So NMT has exceptional qualities compared to SMT. In his SlatorCon London presentation, however, John Tinsley, CEO of
                            Iconic, wanted to set expectations straight, pointing out that it is best for the industry at large to not focus only on the hype.

                            What are NMT’s limitations and how exactly do they impact possible business applications?

                            1.   A major limitation is handling figures of speech, metaphors, hyperbole, irony,
                                 and other ambiguous semantic elements of language.

                                 When Slator talked to inbound marketing company HubSpot for a feature article and asked them about MT,
                                 Localization Manager Chris Englund said they do not use MT, and it was, in fact, a taboo topic. The reason was
                                 simple: marketing required much more creative translation—more transcreation than translation in many places.
                                 So MT at its current level of handling ambiguous semantic elements of language was not an option for them.

                                 On January 2018, Slator published an article on a research paper by Antonio Toral, Assistant Professor at the
                                 University of Groningen and Andy Way, Professor in Computing and Deputy Director of the EU’s ADAPT Centre
                                 for Digital Content Technology. Their research is titled “What Level of Quality can Neural Machine Translation
                                 Attain on Literary Text?”.

                                 They found that the NMT engine they trained on 100 million words of literary text was able to produce
                                 translations that were equal to professional human translation about a fifth to a third of the time.

                                 NMT, like SMT and rule-based MT before it, depends on statistical estimation from a parallel corpus of training
                                 data. So when NMT encounters an idiom, for instance, with words or phrases used in a way that contradicts how
                                 similar words or phrases are used more commonly, it will have a difficult time translating properly.

                                 “It is a core property of all data-driven approaches to machine translation that they learn and reproduce
                                 patterns that they see in the translations that are used for training,” Sennrich said.

                                           “Creative language will remain a challenge because it by definition breaks with
                                           the most common patterns in language.”

                                           —Rico Sennrich, Post-doctoral researcher, University of Edinburgh.

                                           “NMT systems only learn from what’s in the training data, and while the
                                           addition of context for each word (based on training data and words produced
                                           so far) is a valuable addition, the process is hardly transcreation.”

                                           —Joss Moorkens, post-doctoral researcher, ADAPT Center and Lecturer at Dublin City University.

                                 Moorkens pointed out, however, that it does not necessarily mean NMT output cannot be useful for translating
                                 creative texts, referring to Toral and Way’s research.

                                 Cho was optimistic that it was at least not impossible.

                                 “As long as such phenomena occur in a corpus, a better learning algorithm with a better model would be able to
                                 capture them. It is just a matter of how we advance our science and find those better algorithms and models,” he said.

     PAGE: 16 // 37
By the End of 2017, Neural MT was Mainstream           

2.   Despite the fact that NMT is more fluent than SMT, more complex,                                                          Current
     longer sentences still suffer poorer output.                                                                              Limitations
                                                                                                                               of NMT
     Any MT system, regardless of technology or model, stumbles on longer sentences. Indeed, several of the experts
     we spoke to explained that translating longer sentences is just more difficult (even for humans), period.

             “Consider translating the sentence ‘Peter saw Mary’ to French. You can
             probably enumerate all possible French translations of that sentence. Now
             write down a 40-word sentence and try to enumerate all of the possible
             translations. Longer inputs present harder search and modeling problems.”

             —Spence Green, CEO, Lilt

     Samuel Läubli pointed out that NMT quality only degenerates noticeably for very long sentences of over 60
     words. He explained that NMT systems have no “intuition” of sentence length; “translations of long input
     sentences are often too short,” he said. This is the same point Tinsley brought up in his SlatorCon London
     presentation. Sometimes NMT engines would translate a 30-word Chinese sentence into just six English
     words, he said.

     “Current research focuses on modelling coverage (Tu et al., 2016),
     i.e., making sure that all words of the input are covered in the
     output,” Läubli said.

     Jean Sellenart compared the problem of very long sentences to
     the problem of translating a single sentence without the context
     of the entire document. “We don’t have yet the ability to translate
     a sentence given the context of the document. A very long
     sentence is generally multiple sentences put together,” he said.

     Diego Bartolome said they have achieved improved performance
     on longer sentences by applying “segmentation techniques” to
     the source sentence, which is one of the ways SMT output for
     longer sentences were improved in the past.

     There is ongoing research focusing on this specific limitation,
     according to Cho and Sennrich. Sennrich, however, noted that
     “currently, standard test sets don’t strongly incentivize improving
     performance on long sentences, and most neural systems do not
     even use long sentences in training for efficiency reasons.”
                                                                                    Samuel Läubli, CTO, TextShuttle - SlatorCon Zurich 2017
     With the way NMT research has accelerated, however, only time
     will tell whether there will be more research specifically in the translation fluency of longer sentences.

     Using additional layers of technology on NMT engines can help alleviate the issue. “Attention mechanisms”
     and other hidden layers applied to the model can improve quality, but not completely resolve the problem.
     Also, they can come at the cost of computing power.

     Omniscien CEO Rufener pointed out that hybridization of NMT and SMT can also circumvent the problem.

                                                                                                                                 PAGE: 17 // 37
Slator 2018 NMT Report |

     Current                3.   Terminology accuracy may take a hit.
     of NMT
                                 Research into terminology accuracy and input from our respondents confirm that NMT can indeed
                                 sometimes be less accurate when it comes to terminology, and part of it is due to its nature. It is less
                                 consistent and predictable than SMT (which is also a factor that allows it to potentially come up with better
                                 translations and choose rare compound words). Additionally when training SMT models, you can explicitly
                                 force them to learn terminology, which is a trickier concept for NMT.

                                 “General disambiguation depending on a context is problematic,” according to Maja Popovic, Researcher at
                                 the German Research Centre for Artificial Intelligence (DFKI) Language Technology Lab, Berlin, Germany. She
                                 said in the example “I would give my right arm for this,” the word “right” can be translated as “correct” instead
                                 of “opposite of left.”

                                          “Naturally, NMT engines are very good at learning the structure of the language
                                          and are missing the simple ability to memorize a list of words while SMT is (only)
                                          good at that.”

                                          —Jean Sellenart, CTO, Systran

                                 However, there are workarounds to this limitation. One way is applying a “focused attention” mechanism to
                                 the NMT decoder to constrain how the engine translates specific words, and indicate which words those are
                                 through user dictionaries with specific terminology. This is called constrained decoding.

                                 Läubli provides an example: “If «Coffee» is translated as «Kaffee» in our termbase, we force the NMT system to
                                 include «Kaffee» in the German output for any English input containing «Coffee».” He noted, however, that the
                                 approach can be slow, and gets slower the more output constraints are added.

                                 “Accuracy on the terminology is also a matter of the specificity of an MT engine,” Diego Bartolome added.
                                 “We have trained a client-specific engine with 20 million words with NMT, and there are no issues with
                                 terminology; it is indeed accurate.”

                                 Domain adaptation will indeed improve terminology accuracy with relevant training data, according to Sennrich.

                            4.   NMT is a “black box”.

                                 The “black box” problem is not quite clear-cut. The supposed problem is that because NMT components
                                 are trained together, when there is a problem in translation output, it is harder to tell where the problem
                                 originated. This means harder debugging and customization.

                                 “This is correct to an extent,” said Tinsley. “In NMT, if we have an issue, there aren’t many places to look, aside
                                 from the data on which the model was trained, and maybe some of the parameters. Aside from that, it’s a
                                 black box. With SMT we can see exactly how and why a particular translation was generated, which might
                                 point to a component in the process that we could modify.”

                                 Andrew Rufener was inclined to disagree with the problem statement: “No, this is not correct. The neural
                                 model is definitely more complex and does not allow the same level of control in every single step as
                                 the statistical model does. At the same time, neural is not a black box, the same way statistical isn't. The
                                 mechanisms for control however, are different and require different approaches and it is definitely correct to
                                 state that there is less control than with statistical machine translation.”

     PAGE: 18 // 37
By the End of 2017, Neural MT was Mainstream      

“There is no doubt that the customization of internal components (such as Named entities, placeholders,              Current
tags) is more challenging for NMT systems when compared to SMT systems,” said Tony O’Dowd. He added                  Limitations
                                                                                                                     of NMT
however, that SMT faced similar challenges that were resolved as the technology matured.

It seems the consensus of experts we talked to amounted to yes, NMT is more difficult to debug and
customize due to not having the same level of control of individual components, but it is not a big problem.

It appears that it is more a problem of changing how to debug and customize from the way the industry is
used to in SMT.

Sellenart pointed out that “in opposition to previous technology, neural networks absorb individual feedback
very easily and can learn from it quite reliably. To fix a neural network, we just need to teach it how to correct
some mis-translation.”

Sennrich concurred: “Changes to a neural system typically involve retraining of the end-to-end model.”

Pavel Levin of said “at this point we need to combine NMT with other NLP techniques precisely
to be able to control the errors on particularly sensitive parts of text (distances, times, amounts, etc.)”

As for any impact this “black box” problem presents to production-level NMT, Läubli saw none too significant:
“From what I’ve seen in the localization industry, MT engines were mostly ‘customized’ through preprocessing
of the input or post processing of the output. The same is totally possible with NMT systems.”

Sellenart shared the same sentiment. “The limitation is really virtual—some people are afraid of the fact that
this means the output is not predictable so we cannot rely on it,” he said. “In terms of customization, I believe
neural networks are actually the easiest to customize.”

Sennrich also thought neural networks are easier to customize due to the high sensitivity to training
data provided. He ventured that for any limitation encountered due to opacity of trained NMT models,
customization can then be approached through the training data, pre- and post-processing, and
hybridization of MT.

O’Dowd shared how early on in their NMT efforts at KantanMT, these limitations gave them trouble working
with marked-up content. 18 months down the road, however, “we now have these challenges under control
and to a great extent resolved,” he said.

Moorkens offers how they resolved a similar issue working with the Translation for Massive Open Online
Courses (TraMOOC) project: “Our University of Edinburgh partners in the TraMOOC project have had some
success in adapting NMT systems to the MOOC/educational domain, using general data (which will help with
new input) and a smaller amount of domain-specific data, and using transfer learning for domain adaptation.”

                                                                                                                      PAGE: 19 // 37
Slator 2018 NMT Report |

     What’s Next in NMT
                           Key Takeaways
                                               ▶                   As NMT’s fluency outmatches predecessor technology, there is a need
                                                                   for new, comprehensive methods for defining quality and quality gains
                                                                   in research, as well as a means to efficiently combine human evaluation
                                                                   with automated processes into new quality standards.

                                               ▶                   Meanwhile, as NMT highlights the need for high quality training data,
                                                                   not only is there a growing niche market of companies commoditizing
                                                                   parallel corpora, there is also the simultaneous issue of guaranteeing
                                                                   quality and relevance in any training data set.

                                               ▶                   Finally, aspects of training data is just one of many “exciting” research
                                                                   directions in the bustling academic research scene, including using
                                                                   pivot languages and zero shot translation to resolve low-resource
                                                                   language NMT.

                            As NMT increasingly becomes a standard technology across all areas of the language industry, there comes with it
                            a few sweeping changes that are already emerging.

                            How Do You Quantify Quality?

                            Since NMT is dependent on the quality of the training data more so than its quantity, the question of what
                            constitutes “quality” has become a focus.

                            For instance, most research on NMT uses BLEU (bilingual evaluation understudy) for scoring the quality of
                            translation output. The problem is that BLEU, an automated algorithm-based metric, does not necessarily
                            dictate actual translation fluency. The limited applications of BLEU are further strained as NMT outputs become
                            increasingly more fluent over SMT.

                            “BLEU reached its limit for any translation, not only NMT,” said Maja Popovic. All of the experts we talked to agree:
                            BLEU remains useful as a yardstick for measuring the rapid advance of MT in terms of quality, but in terms of
                            actually gauging fluency, it leaves much to be desired. Läubli pointed out, however, that BLEU was intended to
                            take multiple reference translations, and if it were used that way, it would not be as problematic.

                            It is a staple tool for academic research as it tells us how far the latest findings have progressed from previous ones
                            in a fast, predictable manner, but it is definitely not “industry-proof.”

     PAGE: 20 // 37
What’s Next in NMT    

                                                                                                                           How Do You
Replacing BLEU
So while BLEU has limited uses, various researchers and industry peers have been trying out different methods of
gauging fluency.

Läubli said metrics like METEOR would be able to reward synonym use in translation output. Popovic places a vote of
confidence for character-based scores “such as BEER, chrF and characTER… for their potential for MT evaluation”.

O’Dowd said they use a character-based Perplexity scoring mechanism used in conjunction with F-Measure and
TER that provides “a decent (albeit, not always 100% accurate at this point in time) barometer” of quality. He
added though that since these are machine generated scores, they can only be used during the developmental
phase of an NMT engine.

Bartolome said since they use MT for post-editing, they use Levenshtein distance or edit-distance: “we could
enrich it by measuring the ‘complexity’ associated with changes, but it’s not an easy task.”

Silvio Picinini, Machine Translation Language Specialist at eBay, also said edit-distance is a decent metric. He added
that time spent by post-editors and keystrokes ratio may be interesting. “Interactive MT (which predicts the next words)
seem to have been looking at Word Prediction Accuracy (WPA) or nextword accuracy (Koehn et al., 2014),” he said.

Of course, there is a catch in that all these methods require post-editing work, so there’s still a human element.

Human Evaluation Remains
the Ultimate Standard

        “A well-designed blind human evaluation
        remains the most trusted quality
        assessment approach.”

        —Mihai Vlad, VP of Machine Learning Solutions, SDL

All the experts we asked agree that human evaluation is
the definitive metric for fluency. There is no way around it.

BLEU combined with human assessments is an option,
according to Kirti Vashee. Spence Green and Jean
Sellenart agreed that BLEU should be complemented
with human evaluation.

Picinini added that crowdsourced human evaluation may
be a feasible approach as it is “a cheap, fast and accurate
way to evaluate quality.” He added that it also would
accommodate quality levels for different purposes of MT
                                                                      Spence Green, CEO, Lilt - SlatorCon London 2017
in terms of content and audiences, factors that affect
quality expectations.

“Human assessment remains the best way of evaluating machine translation that is at our disposal,” Sennrich
said, noting that the research community regularly performs shared translation tasks with human evaluation.

                                                                                                                            PAGE: 21 // 37
Slator 2018 NMT Report |

     Creating a
     New Quality
                            Creating a New Quality Standard

                            So if BLEU is not industry-proof, and human evaluation, while the ultimate standard, does not really scale as well,
                            then from a holistic point of view, how does one go about assessing the quality of NMT output?

                            Yannis Evangelou, Founder and CEO of linguistic QA company LexiQA, illustrated a process for NMT split into three
                            stages: pre-translation, machine translation and post-editing.

                            Pre-translation includes preparing the training data that needs to be of the highest quality as well as cleaning up legacy
                            translation memories used as a reference corpus. The machine translation process ecompasses the engine’s decoding/
                            encoding process. The post-editing stage includes revision, quality assurance, and quality assessment.

                            “The latter two could take place twice,” Evangelou said, referring to quality assurance and assessment. “During
                            the quality assurance stage, each segment would be checked for various error classes, especially with a view to
                            addressing locale-specific conversions; at this point, the quality assurance engine could provide suggestions for
                            the user (e.g., alternative date notations in the target locale).”

                            Evangelou continued: “Following that, an initial quality assessment could take place (the post-editor would then
                            add annotations, spot false negatives and reject false positives; this way the post-editing effort would also be
                            calculated). As soon as the revision is complete, a second quality assurance check should automatically run to
                            make sure that no new errors have been introduced by the post-editor.”

                            By the end of the process, the overall quality assessment can include the initial NMT output, post-editing effort,
                            and final output, which is the combined MT and human revision. Each part would be assigned different weights, the
                            average of which would form the total score.

                            New and Existing NMT QA Processes

                            Aside from computer generated scores during the developmental phase of an NMT engine, O’Dowd said they also
                            employ a proprietary new platform in KantanMT so professional translators can rate the quality of MT systems.

     PAGE: 2 2 // 37
What’s Next in NMT       

For Mihai Vlad of SDL, automatic metrics should indeed be validated by human assessment. “Equally, quality has                    New and
to be defined in the context of the scenario being used,” he said, offering some examples:                                        Existing
                                                                                                                                  NMT QA
•    Quality for post editing is measured by the translator being more productive.

•    Quality for multilingual eDiscovery is measured by the accuracy of identifying the right documents.

•    Quality for multilingual text analytics is measured by the effectiveness of the analyst in identifying the
     relevant information.

•    Quality for multilingual chat is measured by the feedback rating of the end customer.

Sellenart shared the sentiment, agreeing that since MT is always connected to some use case, “the final evaluation
always needs to be connected to this use case.”

Kyunghyun Cho, speaking from the academia side, also highlights the need for use-case specifics in quality
evaluation: “For instance, the quality of MT for dialogues would need to be measured by how well it facilitates
dialogue between participants of different native languages.”

         “I believe the quality would need to be defined with respect to a downstream task in
         which translation affects its performance.”

         —Kyunghyun Cho, Asst. Professor, New York University

Levin believes that in the near future, the standardization of NMT quality assurance might be as fragmented as
demand: “We will be seeing practitioners rolling out their own metrics which are more relevant to their problems
(e.g. metrics related to handling of particular named entities, scores from custom QA systems, potentially machine
learning based, etc.) and use several of them in combinations.”

He added that he expects industry players to streamline inevitable human evaluation loops either through in-
house resources, as they do in, or external services such as lexiQA or the TAUS DQF framework.

Levin also explained that in’s case, they use business sensitivity framework (BSF). In their research
paper, Levin and his co-authors write: “One important shortcoming of the BLEU score is that it says nothing about
the so-called ‘business sensitive’ errors. For example, the cost of mistranslating ‘Parking is available’ to mean
‘There is free parking’ is much greater than a minor grammatical error in the output.”

BSF is a two-stage system. It first identifies sentences that may contain business sensitive aspects and evaluates
whether they are translated properly with respect to business sensitive information. It then flags problematic NMT output.

Ultimately, the language services market will most probably not take a single route when it comes to performance
benchmarking, quality assessment, and applying metrics to NMT.

“We do not foresee this industrywide,” Rufener said. “It may however happen for particular verticals”

Diego Bartolome from Transperfect offered some examples: “In e-commerce, a metric is more related to conversion
rates. For SEO, number of views. For regulatory, no critical errors. That should probably be the way to go.”

User satisfaction should also be taken into account, according to Picinini. “This could be as simple as a ‘Was this translation
helpful to you?’ with a Yes/No answer.” Google and Facebook actively employ this method in their NMT systems today.

                                                                                                                                   PAGE: 2 3 // 37
You can also read
Next slide ... Cancel