On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Page created by Tyrone Torres
 
CONTINUE READING
On the Limitations of Cross-lingual Encoders as Exposed by
               Reference-Free Machine Translation Evaluation
 Wei Zhao† , Goran Glavaš‡ , Maxime PeyrardΦ , Yang Gao? , Robert WestΦ , Steffen Eger†
         †
            Technische Universität Darmstadt ‡ University of Mannheim, Germany
              Φ
                EPFL, Switerland ? Royal Holloway University of London, UK
                     {zhao,eger}@aiphes.tu-darmstadt.de
      goran@informatik.uni-mannheim.de, yang.gao@rhul.ac.uk
                   {maxime.peyrard,robert.west}@epfl.ch

                      Abstract                           lation (MT) (Bahdanau et al., 2015; Johnson et al.,
                                                         2017) and text summarization (Rush et al., 2015;
    Evaluation of cross-lingual encoders is usually
                                                         Tan et al., 2017), where we do not predict a single
    performed either via zero-shot cross-lingual
    transfer in supervised downstream tasks or           discrete label but generate natural language text.
    via unsupervised cross-lingual textual simi-         Thus, the set of labels for NLG is neither clearly
    larity. In this paper, we concern ourselves          defined nor finite. Yet, the standard evaluation
    with reference-free machine translation (MT)         protocols for NLG still predominantly follow the
    evaluation where we directly compare source          described default paradigm: (1) evaluation datasets
    texts to (sometimes low-quality) system trans-       come with human-created reference texts and (2)
    lations, which represents a natural adversarial
                                                         evaluation metrics, e.g., BLEU (Papineni et al.,
    setup for multilingual encoders. Reference-
    free evaluation holds the promise of web-scale       2002) or METEOR (Lavie and Agarwal, 2007) for
    comparison of MT systems. We systemati-              MT and ROUGE (Lin and Hovy, 2003) for sum-
    cally investigate a range of metrics based on        marization, count the exact “label” (i.e., n-gram)
    state-of-the-art cross-lingual semantic repre-       matches between reference and system-generated
    sentations obtained with pretrained M-BERT           text. In other words, established NLG evaluation
    and LASER models. We find that they perform          compares semantically ambiguous labels from an
    poorly as semantic encoders for reference-free       unbounded set (i.e., natural language texts) via hard
    MT evaluation and identify their two key lim-
                                                         symbolic matching (i.e., string overlap).
    itations, namely, (a) a semantic mismatch be-
    tween representations of mutual translations            The first remedy is to replace the hard symbolic
    and, more prominently, (b) the inability to          comparison of natural language “labels” with a
    punish “translationese”, i.e., low-quality literal   soft comparison of texts’ meaning, using seman-
    translations. We propose two partial reme-           tic vector space representations. Recently, a num-
    dies: (1) post-hoc re-alignment of the vector
                                                         ber of MT evaluation methods appeared focusing
    spaces and (2) coupling of semantic-similarity
    based metrics with target-side language mod-         on semantic comparison of reference and system
    eling. In segment-level MT evaluation, our           translations (Shimanaka et al., 2018; Clark et al.,
    best combined metric surpasses the reference-        2019; Zhao et al., 2019). While these correlate
    based BLEU by 5.7 correlation points. We             better than n-gram overlap metrics with human as-
    make our MT evaluation code available.1              sessments, they do not address inherent limitations
                                                         stemming from the need for reference translations,
1   Introduction                                         namely: (1) references are expensive to obtain; (2)
A standard evaluation setup for supervised machine       they assume a single correct solution and bias the
learning (ML) tasks assumes an evaluation metric         evaluation, both automatic and human (Dreyer and
which compares a gold label to a classifier predic-      Marcu, 2012; Fomicheva and Specia, 2016), and
tion. This setup assumes that the task has clearly       (3) limitation of MT evaluation to language pairs
defined and unambiguous labels and, in most cases,       with available parallel data.
that an instance can be assigned few labels. These          Reliable reference-free evaluation metrics, di-
assumptions, however, do not hold for natural lan-       rectly measuring the (semantic) correspondence
guage generation (NLG) tasks like machine trans-         between the source language text and system trans-
  1
    https://github.com/AIPHES/ACL20-Reference-Free-      lation, would remove the need for human refer-
MT-Evaluation                                            ences and allow for unlimited MT evaluations: any
monolingual corpus could be used for evaluating          re-mapping step, we can to some extent alleviate
MT systems. However, the proposals of reference-         both previous issues. (iv) Finally, we show that the
free MT evaluation metrics have been few and far         combination of cross-lingual reference-free metrics
apart and have required either non-negligible super-     and language modeling on the target side (which
vision (i.e., human translation quality labels) (Spe-    is able to detect “translationese”), surpasses the
cia et al., 2010) or language-specific preprocessing     performance of reference-based baselines.
like semantic parsing (Lo et al., 2014; Lo, 2019),          Beyond designating a viable prospect of web-
both hindering the wide applicability of the pro-        scale domain-agnostic MT evaluation, our findings
posed metrics. Moreover, they have also typically        indicate that the challenging task of reference-free
exhibited performance levels well below those of         MT evaluation is able to expose an important limi-
standard reference-based metrics (Ma et al., 2019).      tations of current state-of-the-art multilingual en-
   In this work, we comparatively evaluate a num-        coders, i.e., the failure to properly represent corrupt
ber of reference-free MT evaluation metrics that         input, that may go unnoticed in simpler evaluation
build on the most recent developments in multilin-       setups such as zero-shot cross-lingual text classi-
gual representation learning, namely cross-lingual       fication or measuring cross-lingual text similarity
contextualized embeddings (Devlin et al., 2019)          not involving such “adversarial” conditions. We be-
and cross-lingual sentence encoders (Artetxe and         lieve this is a prospect direction to facilitate cross-
Schwenk, 2019). We investigate two types of cross-       lingual representations, together with the recent
lingual reference-free metrics: (1) Soft token-level     benchmark which focuses on zero-shot transfer sce-
alignment metrics find the optimal soft alignment        narios (Hu et al., 2020).
between source sentence and system translation us-
ing Word Mover’s Distance (WMD) (Kusner et al.,          2   Related Work
2015). Zhao et al. (2019) recently demonstrated
                                                         Manual human evaluations of MT systems undoubt-
that WMD operating on BERT representations (De-
                                                         edly yield the most reliable results, but are expen-
vlin et al., 2019) substantially outperforms baseline
                                                         sive, tedious, and generally do not scale to a mul-
MT evaluation metrics in the reference-based set-
                                                         titude of domains. A significant body of research
ting. In this work, we investigate whether WMD
                                                         is thus dedicated to the study of automatic evalu-
can yield comparable success in the reference-free
                                                         ation metrics for machine translation. Here, we
(i.e., cross-lingual) setup; (2) Sentence-level simi-
                                                         provide an overview of both reference-based MT
larity metrics measure the similarity between sen-
                                                         evaluation metrics and recent research efforts to-
tence representations of the source sentence and
                                                         wards reference-free MT evaluation, which lever-
system translation using cosine similarity.
                                                         age cross-lingual semantic representations and un-
   Our analysis yields several interesting find-         supervised MT techniques.
ings. (i) We show that, unlike in the monolingual
reference-based setup, metrics that operate on con-      Reference-based MT evaluation. Most of the
textualized representations generally do not outper-     commonly used evaluation metrics in MT com-
form symbolic matching metrics like BLEU, which          pare system and reference translations. They are
operate in the reference-based environment. (ii)         often based on surface forms such as n-gram over-
We identify two reasons for this failure: (a) firstly,   laps like BLEU (Papineni et al., 2002), SentBLEU,
cross-lingual semantic mismatch, especially for          NIST (Doddington, 2002), chrF++ (Popović, 2017)
multi-lingual BERT (M-BERT), which construes a           or METEOR++(Guo and Hu, 2019). They have
shared multilingual space in an unsupervised fash-       been extensively tested and compared in recent
ion, without any direct bilingual signal; (b) sec-       WMT metrics shared tasks (Bojar et al., 2017a; Ma
ondly, the inability of the state-of-the-art cross-      et al., 2018a, 2019).
lingual metrics based on multilingual encoders              These metrics, however, operate at the surface
to adequately capture and punish “translationese”,       level, and by design fail to recognize semantic
i.e., literal word-by-word translations of the source    equivalence lacking lexical overlap. To overcome
sentence—as translationese is an especially per-         these limitations, some research efforts exploited
sistent property of MT systems, this problem is          static word embeddings (Mikolov et al., 2013b)
particularly troubling in our context of reference-      and trained embedding-based supervised metrics
free MT evaluation. (iii) We show that by execut-        on sufficiently large datasets with available hu-
ing an additional weakly-supervised cross-lingual        man judgments of translation quality (Shimanaka
et al., 2018). With the development of contextual        stance, Yankovskaya et al. (2019) propose to train
word embeddings (Peters et al., 2018; Devlin et al.,     a metric combining multilingual embeddings ex-
2019), we have witnessed proposals of semantic           tracted from M-BERT and LASER (Artetxe and
metrics that account for word order. For exam-           Schwenk, 2019) together with the log-probability
ple, Clark et al. (2019) introduce a semantic met-       scores from neural machine translation. Our work
ric relying on sentence mover’s similarity and the       differs from that of Yankovskaya et al. (2019) in
contextualized ELMo embeddings (Peters et al.,           one crucial aspect: the cross-lingual reference-free
2018). Similarly, Zhang et al. (2019) describe a         metrics that we investigate and benchmark do not
reference-based semantic similarity metric based         require any human supervision.
on contextualized BERT representations (Devlin
et al., 2019). Zhao et al. (2019) generalize this line   Cross-lingual Representations. Cross-lingual
of work with their MoverScore metric, which com-         text representations offer a prospect of model-
putes the mover’s distance, i.e., the optimal soft       ing meaning across languages and support cross-
alignment between tokens of the two sentences,           lingual transfer for downstream tasks (Klementiev
based on the similarities between their contextual-      et al., 2012; Rücklé et al., 2018; Glavaš et al., 2019;
ized embeddings. Mathur et al. (2019) train a su-        Josifoski et al., 2019; Conneau et al., 2020). Most
pervised BERT-based regressor for reference-based        recently, the (massively) multilingual encoders,
MT evaluation.                                           such as multilingual M-BERT (Devlin et al., 2019),
                                                         XLM-on-RoBERTa (Conneau et al., 2020), and
                                                         (sentence-based) LASER, have profiles themselves
Reference-free MT evaluation. Recently, there
                                                         as state-of-the-art solutions for (massively) multi-
has been a growing interest in reference-free MT
                                                         lingual semantic encoding of text. While LASER
evaluation (Ma et al., 2019), also referred to as
                                                         has been jointly trained on parallel data of 93 lan-
“quality estimation” (QE) in the MT community.
                                                         guages, M-BERT has been trained on the concate-
In this setup, evaluation metrics semantically com-
                                                         nation of monolingual data in more than 100 lan-
pare system translations directly to the source sen-
                                                         guages, without any cross-lingual mapping signal.
tences. The attractiveness of automatic reference-
                                                         There has been a recent vivid discussion on the
free MT evaluation is obvious: it does not require
                                                         cross-lingual abilities of M-BERT (Pires et al.,
any human effort or parallel data. To approach
                                                         2019; K et al., 2020; Cao et al., 2020). In par-
this task, Popović et al. (2011) exploit a bag-of-
                                                         ticular, Cao et al. (2020) show that M-BERT often
word translation model to estimate translation qual-
                                                         yields disparate vector space representations for
ity, which sums over the likelihoods of aligned
                                                         mutual translations and propose a multilingual re-
word-pairs between source and translation texts.
                                                         mapping based on parallel corpora, to remedy for
Specia et al. (2013) estimate translation quality us-
                                                         this issue. In this work, we introduce re-mapping
ing language-agnostic linguistic features extracted
                                                         solutions that are resource-leaner and require easy-
from source lanuage texts and system translations.
                                                         to-obtain limited-size word translation dictionaries
Lo et al. (2014) introduce XMEANT as a cross-
                                                         rather than large parallel corpora.
lingual reference-free variant of MEANT, a metric
based on semantic frames. Lo (2019) extended             3     Reference-Free MT Evaluation Metrics
this idea by leveraging M-BERT embeddings. The
resulting metric, YiSi-2, evaluates system trans-        In the following, we use x to denote a source sen-
lations by summing similarity scores over words          tence (i.e., a sequence of tokens in the source lan-
pairs that are best-aligned mutual translations. YiSi-   guage), y to denote a system translation of x in
2-SRL optionally combines an additional similar-         the target language, and y? to denote the human
ity score based on the alignment over the semantic       reference translation for x.
structures (e.g., semantic roles and frames). Both
metrics are reference-free, but YiSi-2-SRL is not        3.1    Soft Token-Level Alignment
resource-lean as it requires a semantic parser for       We start from the MoverScore (Zhao et al., 2019),
both languages.                                          a recently proposed reference-based MT evaluation
   Recent progress in cross-lingual semantic sim-        metric designed to measure the semantic similarity
ilarity (Agirre et al., 2016; Cer et al., 2017) and      between system outputs (y) and human references
unsupervised MT (Artetxe and Schwenk, 2019)              (y? ). It finds an optimal soft semantic alignments
has also led to novel reference-free metrics. For in-    between tokens from y and y? by minimizing the
Word Mover’s Distance (Kusner et al., 2015). In                mutual word or sentence translations.3 To this end,
this work, we extend the MoverScore metric to op-              we apply two simple, weakly-supervised linear pro-
erate in the cross-lingual setup, i.e., to measure the         jection methods for post-hoc improvement of the
semantic similarity between n-grams (unigram or                cross-lingual alignments in these multilingual rep-
bigrams) of the source text x and the system trans-            resentation spaces.
lation y, represented with embeddings originating
                                                               Notation. Let D = {(w`1 , wk1 ), . . . , (w`n , wkn )}
from a cross-lingual semantic space.
                                                               be a set of matched word or sentence pairs from
   First, we decompose the source text x into a se-
                                                               two different languages ` and k. We define a re-
quence of n-grams, denoted by xn = (xn1 , . . . , xnm )
                                                               mapping function f such that any f (E(w` )) and
and then do the same operation for the system
                                                               E(wk ) are better aligned in the resulting shared
translation y, denoting the resulting sequence of
                                                               vector space. We investigate two resource-lean
n-grams with yn . Given xn and yn , we can
                                                               choices for the re-mapping function f .
then define a distance matrix C such that Cij =
kE(xni ) − E(ynj )k2 is the distance between the i-th          Linear Cross-lingual Projection (CLP). Fol-
n-gram of x and the j-th n-gram of y, where E is               lowing related work (Schuster et al., 2019), we
a cross-lingual embedding function that maps text              re-map contextualized embedding spaces using lin-
in different languages to a shared embedding space.            ear projection. Given ` and k, we stack all vectors
With respect to the function E, we experimented                of the source language words and target language
with cross-lingual representations induced (a) from            words for pairs D, respectively, to form matrices
static word embeddings with RCSLS (Joulin et al.,              X` and Xk ∈ Rn×d , with d as the embedding
2018)) (b) with M-BERT (Devlin et al., 2019) as                dimension and n as the number of word or sen-
the multilingual encoder; with a focus on the latter.          tence alignments. The word pairs we use to cali-
   WMD between the two sequences of n-grams                    brate M-BERT are extracted from EuroParl (Koehn,
xn and y n with associated n-gram weights 2 to                 2005) using FastAlign (Dyer et al., 2013), and the
            n                 n
fxn ∈ R|x | and fyn ∈ R|y | is defined as:                     sentence pairs to calibrate LASER are sampled
                                         X                     directly from EuroParl.4 For M-BERT, we take
 m(x, y) := WMD(xn , y n ) = min            Cij · Fij ,        the representations of the last transformer layer as
                                       F
                                            ij                 the cross-lingual token representations. Mikolov
                       |
s.t. F 1 = fxn , F 1 = fyn ,                                   et al. (2013a) propose to learn a projection matrix
                   n       n
                                                               W ∈ Rd×d by minimizing the Euclidean distance
where F ∈ R|x |×|y | is a transportation matrix                beetween the projected source language vectors
with Fij denoting the amount of flow traveling                 and their corresponding target language vectors:
from xni to ynj .
                                                                                min kW X` − Xk k2 .
3.2   Sentence-Level Semantic Similarity                                         W

In addition to measuring semantic distance between             Xing et al. (2015) achieve further improvement on
x and y at word-level, one can also encode them                the task of bilingual lexicon induction (BLI) by
into sentence representations with multilingual sen-           constraining W to an orthogonal matrix, i.e., such
tence encoders like LASER (Artetxe and Schwenk,                that W T W = I. This turns the optimization into
2019), and then measure their cosine distance                  the well-known Procrustes problem (Schönemann,
                                                               1966) with the following closed-form solution:
                                 E(x)| E(y)
        m(x, y) = 1 −                          .
                               kE(x)k · kE(y)k                        Ŵ = U V | , U ΣV | = SVD(X` Xk| )
3.3   Improving Cross-Lingual Alignments                       We note that the above CLP re-mapping is known to
Initial analysis indicated that, despite the multilin-         have deficits, i.e., it requires the embedding spaces
gual pretraining of M-BERT (Devlin et al., 2019)               of the involved languages to be approximately iso-
and LASER (Artetxe and Schwenk, 2019), the                     morphic (Søgaard et al., 2018; Vulić et al., 2019).
monolingual subspaces of the multilingual spaces                  3
                                                                    LASER is jointly trained on parallel corpora of different
they induce are far from being semantically well-              languages, but in resource-lean language pairs, the induced
aligned, i.e., we obtain fairly distant vectors for            embeddings from mutual translations may be far apart.
                                                                  4
                                                                    While in pretraining LASER requires large parallel cor-
   2
     We follow Zhao et al. (2019) in obtaining n-gram embed-   pora, we believe that fine-tuning/calibrating the embeddings
dings and their associated weights based on IDF.               post-hoc requires fewer data points.
Recently, some re-mapping methods that report-                   (CLP or UMD). For example, Mover-2 + UMD(M-
edly remedy for this issue have been suggested                   BERT) denotes the metric combining MoverScore
(Cao et al., 2020; Glavaš and Vulić, 2020; Mohiud-             based on bigram alignments, with M-BERT embed-
din and Joty, 2020). We leave the investigation of               dings and UMD as the post-hoc alignment method.
these novel techniques for our future work.
                                                                 Sentence-level metric. We denote our sentence-
Universal      Language      Mismatch-Direction                  level metrics as: Cosine + Align(Embedding). For
(UMD) Our second post-hoc linear alignment                       example, Cosine + CLP(LASER) measures the co-
method is inspired by the recent work on removing                sine distance between the sentence embeddings
biases in distributional word vectors (Dev and                   obtained with LASER, post-hoc aligned with CLP.
Phillips, 2019; Lauscher et al., 2019). We adopt
the same approaches in order to quantify and                     4.1   Datasets
remedy for the “language bias”, i.e., representation             We collect the source language sentences, their sys-
mismatches between mutual translations in the                    tem and reference translations from the WMT17-19
initial multilingual space. Formally, given ` and                news translation shared task (Bojar et al., 2017b;
k, we create individual misalignment vectors                     Ma et al., 2018b, 2019), which contains predictions
E(w`i ) − E(wki ) for each bilingual pair in D.                  of 166 translation systems across 16 language pairs
Then we stack these individual vectors to form                   in WMT17, 149 translation systems across 14 lan-
a matrix Q ∈ Rn×d . We then obtain the global                    guage pairs in WMT18 and 233 translation systems
misalignment vector vB as the top left singular                  across 18 language pairs in WMT19. We evaluate
vector of Q. The global misalignment vector                      for X-en language pairs, selecting X from a set
presumably captures the direction of the represen-               of 12 diverse languages: German (de), Chinese
tational misalignment between the languages better               (zh), Czech (cs), Latvian (lv), Finnish (fi), Russian
than the individual (noisy) misalignment vectors                 (ru), and Turkish (tr), Gujarati (gu), Kazakh (kk),
E(w`i ) − E(wki ). Finally, we modify all vectors                Lithuanian (lt) and Estonian (et). Each language
E(w` ) and E(wk ), by subtracting their projections              pair in WMT17-19 has approximately 3,000 source
onto the global misalignment direction vector vB :               sentences, each associated to one reference transla-
                                                                 tion and to the automatic translations generated by
        f (E(w` )) = E(w` ) − cos(E(w` ), vB )vB .               participating systems.
Language Model BLEU scores often fail to re-                     4.2   Baselines
flect the fluency level of translated texts (Edunov
                                                                 We compare with a range of reference-free metrics:
et al., 2019). Hence, we use the language model
                                                                 ibm1-morpheme and ibm1-pos4gram (Popović,
(LM) of the target language to regularize the cross-
                                                                 2012), LASIM (Yankovskaya et al., 2019), LP
lingual semantic similarity metrics, by coupling
                                                                 (Yankovskaya et al., 2019), YiSi-2 and YiSi-2-srl
our cross-lingual similarity scores with a GPT lan-
                                                                 (Lo, 2019), and reference-based baselines BLEU
guage model of the target language (Radford et al.,
                                                                 (Papineni et al., 2002), SentBLEU (Koehn et al.,
2018). We expect the language model to penalize
                                                                 2007) and ChrF++ (Popović, 2017) for MT eval-
translationese, i.e., unnatural word-by-word trans-
                                                                 uation (See §2).6 The main results are reported
lations and boost the performance of our metrics.5
                                                                 on WMT17. We report the results obtained on
4       Experiments                                              WMT18 and WMT19 in the Appendix.

In this section, we evaluate the quality of our MT               4.3   Results
reference-free metrics by comparing them with hu-                Figure 1 shows that our metric, namely M OVER -2
man judgments of translation quality.                            + CLP(M-BERT) ⊕ LM, operating on modified
Word-level metrics. We denote our word-level                     M-BERT with the post-hoc re-mapping and com-
alignment metrics based on WMD as MoverScore-                    bining a target-side LM, outperforms BLEU by
ngram + Align(Embedding), where Align is one of                  5.7 points in segment-level evaluation and achieves
our two post-hoc cross-lingual alignment methods                 comparable performance in the system-level evalu-
                                                                 ation. Figure 2 shows that the same metric obtains
    5
     We linearly combine the cross-lingual metrics with the
                                                                    6
LM scores using a coefficient of 0.1 for all setups. We choose        The code of these unsupervised metrics is not released,
this value based on initial experiments on one language pair.    thus we compare to their official results on WMT19 only.
Setting     Metrics                                        cs-en de-en fi-en lv-en ru-en tr-en zh-en Average
                                        BLEU                                           43.5   43.2   57.1 39.3     48.4 53.8     51.2   48.1
                            m(y∗ , y)
                                        CHR F++                                        52.3   53.4   67.8 52.0     58.8 61.4     59.3   57.9
                                        Baseline with Original Embeddings
                                        M OVER -1 + M-BERT                             22.7   37.1   34.8 26.0     26.7 42.5     48.2   34.0
                                        C OSINE + LASER                                32.6   40.2   41.4 48.3     36.3 42.3     46.7   41.1
                                        Cross-lingual Alignment for Sentence Embedding
                                        C OSINE + CLP(LASER)                           33.4   40.5   42.0 48.6     36.0 44.7     42.2   41.1
                                        C OSINE + UMD(LASER)                           36.6   28.1   45.5 48.5     31.3 46.2     49.4   40.8
                            m(x, y)     Cross-lingual Alignment for Word Embedding
                                        M OVER -1   +   RCSLS                          18.9   26.4   31.9   33.1   25.7   31.1   34.3   28.8
                                        M OVER -1   +   CLP(M-BERT)                    33.4   38.6   50.8   48.0   33.9   51.6   53.2   44.2
                                        M OVER -2   +   CLP(M-BERT)                    33.7   38.8   52.2   50.3   35.4   51.0   53.3   45.0
                                        M OVER -1   +   UMD(M-BERT)                    22.3   38.1   34.5   30.5   31.2   43.5   48.6   35.5
                                        M OVER -2   +   UMD(M-BERT)                    23.1   38.9   37.1   34.7   33.0   44.8   48.9   37.2
                                        Combining Language Model
                                        C OSINE + CLP(LASER) ⊕ LM                      48.8   46.7   63.2   66.2   51.0   54.6   48.6   54.2
                                        C OSINE + UMD(LASER) ⊕ LM                      49.4   46.2   64.7   66.4   51.1   56.0   52.8   55.2
                                        M OVER -2 + CLP(M-BERT) ⊕ LM                   46.5   46.4   63.3   63.8   47.6   55.5   53.5   53.8
                                        M OVER -2 + UMD(M-BERT) ⊕ LM                   41.8   46.8   60.4   59.8   46.1   53.8   52.4   51.6

                                 Table 1: Pearson correlations with segment-level human judgments on WMT17 dataset.

                      100                                  93.3                           15 points gains (73.1 vs. 58.1), averaged over 7
                                                                        90.1
                                                                                          languages, on WMT19 (system-level) compared to
                       80                                                                 the the state-of-the-art reference-free metric YiSi-2.
Pearson Correlation

                                                                                          Except for one language pair, gu-en, our metric
                       60                   53.8
                                 48.1                                                     performs on a par with the reference-based BLEU
                       40                                                                 (see Table 8 in the Appendix).
                                                                                             In Table 1, we exhaustively compare results for
                       20                                                                 several of our metric variants, based either on M-
                                                                                          BERT or LASER. We note that re-mapping has
                        0                                                                 considerable effect for M-BERT (up to 10 points
                                 Segment-level              System-level
                                                                                          improvements), but much less so for LASER. We
                                          BLEU             This work
                                                                                          believe that this is because the underlying embed-
                                                                                          ding space of LASER is less ‘misaligned’ since it
 Figure 1: The averaged results of our best-performing
 metric, together with reference-based BLEU on                                            has been (pre-)trained on parallel data.7 While the
 WMT17.                                                                                   re-mapping is thus effective for metrics based on
                                                                                          M-BERT, we still require the target-side LM to out-
                      100 System-level BLEU: 91.2                                         perform BLEU. We assume the LM can address
                                                                                          challenges that the re-mapping apparently is not
                       80
Pearson Correlation

                                                                                          able to handle properly; see our discussion in §5.1.
                                                                      This Work:73.1
                       60                                                                    Overall, we remark that none of our metric com-
                               ibm1-morpheme:52.4          LASIM:56.2YiSi-2:58.1          binations performs consistently best. The reason
                       40                                          LP:48.1
                                                                                          may be that LASER and M-BERT are pretrained
                            ibm1-pos4gram:33.9
                       20                                                                 over hundreds of languages with substantial differ-
                                                                                          ences in corpora sizes in addition to the different
                        0                                                                 effects of the re-mapping. However, we observe
                        2010     2012      2014     2016    2018       2020
                                                                                          that M OVER -2 + CLP(M-BERT) performs best
  Figure 2: The averaged results of our metric best-
                                                                                              7
  performing metric, together with the official results of                                      However, in the appendix, we find that re-mapping
  reference-free metrics, and reference-based BLEU on                                     LASER using 2k parallel sentences achieves considerable
                                                                                          improvements on low-resourcing languages, e.g., kk-en (from
  system-level WMT19.                                                                     -61.1 to 49.8) and lt-en (from 68.3 to 75.9); see Table 8.
on average over all language pairs when the LM             Figure 3 shows histograms for the d statistic for
is not added. When the LM is added, M OVER -2           the 50 selected sentences. This illustrates that both
+ CLP(M-BERT) ⊕ LM and C OSINE + UMD                    M OVER + M-BERT and C OSINE +LASER prefer
(LASER) ⊕ LM perform comparably. This indi-             the original human references over random reorder-
cates that there may be a saturation effect when it     ings, indicating that they are not BOW models, a
comes to the LM or that the LM coefficients should      reassuring finding. They are largely indifferent be-
be tuned individually for each semantic similarity      tween correct English word order and the situation
metric based on cross-lingual representations.          where the word order of the human reference is the
                                                        same as the German. Finally, they strongly pre-
5     Analysis                                          fer the expert word-by-word translations over the
We first analyze preferences of our metrics based       human references.
on M-BERT and LASER (§5.1) and then examine                Taken together, this yields the following con-
how much parallel data we need for re-mapping our       clusions: (i) it appears that M-BERT and LASER
vector spaces (§5.2). Finally, we discuss whether it    are mostly indifferent between correct target lan-
is legitimate to correlate our metric scores, which     guage word order and the situation where source
evaluate the similarity of system predictions and       and target language have identical word order; (ii)
source texts, to human judgments based on system        this appears to make them prefer expert word-by-
predictions and references (§5.3).                      word translations the most: for a given source text,
                                                        these have higher lexical overlap than human refer-
5.1    Metric preferences                               ences and in addition they have a favorable target
To analyze why our metrics based on M-BERT and          language syntax, viz., where the source and target
LASER perform so badly for the task of reference-       language word order are equal (confirmed by (i)).
free MT evaluation, we query them for their pref-       This explains why our metrics do not perform well,
erences. In particular, for a fixed source sentence     by themselves and without a language model, as
x, we consider two target sentences ỹ and ŷ and       reference-free MT evaluation metrics. More worry-
evaluate the following score difference:                ingly, it indicates that cross-lingual M-BERT and
                                                        LASER are not robust to the ‘adversarial inputs’
         d(ỹ, ŷ; x) := m(x, ỹ) − m(x, ŷ)     (1)    given by MT systems.

 When d > 0, then metric m prefers ỹ over ŷ, given    Automatic word-by-word translations. For a
x, and when d < 0, this relationship is reversed.       large-scale analysis across different language pairs,
In the following, we compare preferences of our         we resort to automatic word-by-word translations
metrics for specifically modified target sentences      obtained from Google Translate (GT). To do so,
ỹ over the human references y? . We choose ỹ to       we look up each word independently of context in
be (1) a random reordering of y? , to ensure that       GT to compile dictionaries. When a word has sev-
our metrics do not have the BOW (bag-of-words)          eral translations, we keep the first one offered by
property, (2) a word-order preserving translation of    GT. Due to context-independence, the GT word-by-
x (i.e., human-written or automatic word-by-word        word translations are of much lower quality than
translation coupled with human reordering of the        the expert word-by-word translations since they of-
English y? ). The latter tests for preferences for      ten pick the wrong word senses—e.g., the German
literal translations, a common MT-system property.      word sein may either be a personal pronoun (his)
                                                        or the infinitive to be, which would be selected
                                                        correctly only by chance; cf. Table 2.
Expert word-by-word translations and reorder-              Instead of reporting histograms of d, we define a
ing. We had an expert (one of the co-authors)           “W2W” statistic that counts the relative number of
translate 50 German sentences word-by-word into         times that d(x0 , y? ) is positive, where x0 denotes
English. Table 2 illustrates this scenario. We note     the literal translation of x into the target language:
how bad the word-by-word translations sometimes
are even for the closely related language pair such                      1 X
                                                               W2W :=          I( d(x0 , y? ) > 0 )       (2)
as German-English. For example, the word-by-                             N 0 ?
                                                                             (x ,y )
word translations in English retain the original Ger-
man verb final positions, leading to quite ungram-      Here N normalizes W2W to lie in [0, 1] and a high
matical English translations.                           W2W score indicates the metric prefers transla-
x                  Dieser von Langsamkeit geprägte Lebensstil scheint aber ein Patentrezept für ein hohes Alter zu sein.
      y?                 However, this slow pace of life seems to be the key to a long life.
      y? -random         To pace slow seems be the this life. life to a key however, of long
      y? -reordered      This slow pace of life seems however the key to a long life to be.
      x0 -GT             This from slowness embossed lifestyle seems but on nostrum for on high older to his.
      x0 -expert         This of slow pace characterized life style seems however a patent recipe for a high age to be.
      x                  Putin teilte aus und beschuldigte Ankara, Russland in den Rücken gefallen zu sein.
      y?                 Mr Putin lashed out accusing Ankara of stabbing Moscow in the back.
      y? -random         Moscow accusing lashed Putin the in Ankara out, Mr of back. stabbing
      y? -reordered      Mr Putin lashed out, accusing Ankara of Moscow in the back stabbing.
      x0 -GT             Putin divided out and accused Ankara Russia in the move like to his.
      x0 -expert         Putin lashed out and accused Ankara, Russia in the back fallen to be.

Table 2: Original German input sentence x, together with the human reference y? , in English, and a randomly
(y? -random) and expertly reordered (y? -reordered) English sentence as well as expert word-by-word translation
(x0 ) of the German source sentence. The latter is either obtained by the human expert or by Google Translate (GT).

               m(x, y*-random)-m(x, y*)                            m(x, y*-reorder)-m(x, y*)                                        m(x, x'-expert)-m(x, y*)
14                                                                                                                      14
                                                    12
12                                                                                                                      12
                                                    10
10                                                                                                                      10
 8                                                   8                                                                   8
 6                                                   6                                                                   6
 4                                                   4                                                                   4
 2                                                   2                                                                   2
 0                                                   0                                                                   0
     0.25   0.20   0.15 0.10   0.05 0.00   0.05          0.075 0.050 0.025 0.000 0.025 0.050 0.075                           0.05    0.00     0.05     0.10     0.15
                     LASER       M-BERT                                LASER         M-BERT                                             LASER         M-BERT

Figure 3: Histograms of d scores defined in Eq. (1). Left: Metrics based on LASER and M-BERT favor gold over
randomly-shuffled human references. Middle: Metrics are roughly indifferent between gold and reordered human
references. Right: Metrics favor expert word-by-word translations over gold human references.

tionese over human-written references. Table 3                                                                     Seven-languages-to-English
                                                                                                      45
shows that reference-free metrics with original em-
beddings (LASER and M-BERT) either still prefer                                                       40
                                                                                Pearson Correlation

literal over human translations (e.g., 70.2 in cs-en)
                                                                                                      35
or struggle in distinguishing them. Re-mapping
helps to a small degree. Only when combined with                                                      30                               Mover-2 + CLP(M-BERT)
the LM scores do we get adequate scores for the                                                                                        Mover-2 + UMD(M-BERT)
                                                                                                      25                               Mover-2 + M-BERT
W2W statistic. Indeed, the LM is expected to cap-                                                                                      Cosine + CLP(LASER)
ture unnatural word order in the target language and                                                  20                               Cosine + UMD(LASER)
                                                                                                                                       Cosine + LASER
penalize word-by-word translations by recognizing
them as much less likely to appear in a language.                                                                2000        4000      6000          8000      10000
  Note that for expert word-by-word translations,                                Figure 4: Comparison in sentence- and word- align-
we would expect the metrics to perform even worse.                               ment on different size of the parallel corpus (x-axis).
                                                                                 The results are averaged on seven languages pairs in
                                                                                 WMT17.
 Metrics                                          cs-en de-en fi-en
 C OSINE + LASER                                  70.2      65.7     53.9
 C OSINE + CLP(LASER)                             70.7      64.8     53.7         5.2                      Size of Parallel Corpora
 C OSINE + UMD(LASER)                             67.5      59.5     52.9
 C OSINE + UMD(LASER) ⊕ LM                        7.0        7.1      6.4        Figure 4 compares sentence- and word-level re-
 M OVER -2 + M-BERT                               61.8      50.2     45.9        mapping trained with a varying number of paral-
 M OVER -2 + CLP(M-BERT)                          44.6      44.5     32.0        lel sentences. Metrics based on M-BERT result
 M OVER -2 + UMD(M-BERT)                          54.5      44.3     39.6
 M OVER -2 + CLP(M-BERT) ⊕ LM                     7.3       10.2      6.4        in the highest correlations after re-mapping, even
                                                                                 with a small amount of training data (1k). We ob-
 Table 3: W2W statistics for selected language pairs.                            serve that Cosine + CLP(LASER) and Mover-2 +
                                                                                 CLP(M-BERT) show very similar trends with a
sharp increase with increasing amounts of paral-        sentences in the target language. We show that
lel data and then level off quickly. However, the       the reference-free coupling of cross-lingual
M-BERT based Mover-2 reaches its peak and out-          similarity scores with the target-side language
performs the original baseline with only 1k data,       model surpasses the reference-based BLEU in
while LASER needs 2k before beating the corre-          segment-level MT evaluation.
sponding original baseline.                                We believe our results have two relevant implica-
                                                        tions. First, they portray the viability of reference-
5.3    Human Judgments                                  free MT evaluation and warrant wider research
The WMT datasets contain segment- and system-           efforts in this direction. Second, they indicate that
level human judgments that we use for evaluat-          reference-free MT evaluation may be the challeng-
ing the quality of our reference-free metrics. The      ing (“adversarial”) evaluation task for multilingual
segment-level judgments assign one direct assess-       text encoders as it uncovers some of their shortcom-
ment (DA) score to each pair of system and human        ings (prominently, the inability to capture seman-
translation, while system-level judgments associate     tically non-sensical word-by-word translations or
each system with a single DA score averaged across      paraphrases) which remain hidden in their common
all pairs in the dataset. We initially suspected the    evaluation settings.
DA scores to be biased for our setup—which com-
pares x with y—as they are based on comparing           Acknowledgments
y? and y. Indeed, it is known that (especially) hu-     We thank the anonymous reviewers for their in-
man professional translators “improve” y? , e.g., by    sightful comments and suggestions, which greatly
making it more readable, relative to the original x     improved the final version of the paper. This work
(Rabinovich et al., 2017). We investigated the valid-   has been supported by the German Research Foun-
ity of DA scores by collecting human assessments        dation as part of the Research Training Group
in the cross-lingual settings (CLDA), where anno-       Adaptive Preparation of Information from Hetero-
tators directly compare source and translation pairs    geneous Sources (AIPHES) at the Technische Uni-
(x, y) from the WMT17 dataset. This small-scale         versität Darmstadt under grant No. GRK 1994/1.
manual analysis hints that DA scores are a valid        The contribution of Goran Glavaš is supported
proxy for CLDA. Therefore, we decided to treat          by the Eliteprogramm of the Baden-Württemberg-
them as reliable scores for our setup and evaluate      Stiftung, within the scope of the grant AGREE.
our proposed metrics by comparing their correla-
tion with DA scores.
                                                        References
6     Conclusion
                                                        Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab,
Existing semantically-motivated metrics for               Aitor Gonzalez-Agirre, Rada Mihalcea, German
                                                          Rigau, and Janyce Wiebe. 2016. SemEval-2016
reference-free evaluation of MT systems have so           Task 1: Semantic Textual Similarity, Monolingual
far displayed rather poor correlation with human          and Cross-Lingual Evaluation. In Proceedings of
estimates of translation quality. In this work, we        the 10th International Workshop on Semantic Eval-
investigate a range of reference-free metrics based       uation (SemEval-2016), pages 497–511, San Diego,
on cutting-edge models for inducing cross-lingual         California. Association for Computational Linguis-
                                                          tics.
semantic representations: cross-lingual (contex-
tualized) word embeddings and cross-lingual             Mikel Artetxe and Holger Schwenk. 2019. Massively
sentence embeddings. We have identified some              Multilingual Sentence Embeddings for Zero-Shot
                                                          Cross-Lingual Transfer and Beyond. Transactions
scenarios in which these metrics fail, prominently        of the Association for Computational Linguistics,
their inability to punish literal word-by-word            7:597–610.
translations (the so-called “translationese”). We
have investigated two different mechanisms for          Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
                                                          gio. 2015. Neural machine translation by jointly
mitigating this undesired phenomenon: (1) an              learning to align and translate. In International Con-
additional (weakly-supervised) cross-lingual              ference on Learning Representations.
alignment step, reducing the mismatch between
                                                        Ondrej Bojar, Yvette Graham, and Amir Kamran.
representations of mutual translations, and (2)           2017a. Results of the WMT17 metrics shared task.
language modeling (LM) on the target side, which          In Proceedings of the Conference on Machine Trans-
is inherently equipped to punish “unnatural”              lation (WMT).
Ondřej Bojar, Yvette Graham, and Amir Kamran.            Chris Dyer, Victor Chahuneau, and Noah A. Smith.
  2017b. Results of the WMT17 metrics shared                2013. A simple, fast, and effective reparameter-
  task. In Proceedings of the Second Conference on          ization of IBM model 2. In Proceedings of the
  Machine Translation, pages 489–513, Copenhagen,           2013 Conference of the North American Chapter of
  Denmark. Association for Computational Linguis-           the Association for Computational Linguistics: Hu-
  tics.                                                     man Language Technologies, pages 644–648, At-
                                                            lanta, Georgia. Association for Computational Lin-
Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul-        guistics.
   tilingual alignment of contextual word representa-
   tions. In International Conference on Learning Rep-    Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, and
   resentations.                                            Michael Auli. 2019. On the evaluation of machine
                                                            translation systems trained with back-translation.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-          CoRR, abs/1908.05204.
  Gazpio, and Lucia Specia. 2017. SemEval-2017
  Task 1: Semantic Textual Similarity Multilingual        Marina Fomicheva and Lucia Specia. 2016. Refer-
  and Crosslingual Focused Evaluation. In Proceed-         ence bias in monolingual machine translation evalu-
  ings of the 11th International Workshop on Semantic      ation. In 54th Annual Meeting of the Association for
  Evaluation (SemEval-2017), pages 1–14, Vancouver,        Computational Linguistics, ACL 2016-Short Papers,
  Canada. Association for Computational Linguistics.       pages 77–82. ACL Home Association for Computa-
                                                           tional Linguistics.
Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith.
   2019. Sentence Mover’s Similarity: Automatic           Goran Glavaš, Robert Litschko, Sebastian Ruder, and
   Evaluation for Multi-Sentence Texts. In Proceed-         Ivan Vulić. 2019. How to (properly) evaluate cross-
   ings of the 57th Annual Meeting of the Association       lingual word embeddings: On strong baselines, com-
   for Computational Linguistics, pages 2748–2760,          parative analyses, and some misconceptions. In Pro-
   Florence, Italy. Association for Computational Lin-      ceedings of the 57th Annual Meeting of the Associa-
   guistics.                                                tion for Computational Linguistics, pages 710–721.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
  Vishrav Chaudhary, Guillaume Wenzek, Francisco          Goran Glavaš and Ivan Vulić. 2020. Non-linear
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-            instance-based cross-lingual mapping for
  moyer, and Veselin Stoyanov. 2020. Unsupervised           non-isomorphic embedding spaces. In Proceedings
  cross-lingual representation learning at scale. In        of ACL.
  Proceedings of ACL.                                     Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0:
Sunipa Dev and Jeff M. Phillips. 2019. Attenuating          Adopt Syntactic Level Paraphrase Knowledge into
  bias in word vectors. In The 22nd International           Machine Translation Evaluation. In Proceedings of
  Conference on Artificial Intelligence and Statistics,     the Fourth Conference on Machine Translation (Vol-
  AISTATS 2019, 16-18 April 2019, Naha, Okinawa,            ume 2: Shared Task Papers, Day 1), pages 501–506,
  Japan, pages 879–887.                                     Florence, Italy. Association for Computational Lin-
                                                            guistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of        Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-
   deep bidirectional transformers for language under-      ham Neubig, Orhan Firat, and Melvin Johnson.
   standing. In Proceedings of the 2019 Conference          2020. XTREME: A massively multilingual multi-
   of the North American Chapter of the Association         task benchmark for evaluating cross-lingual general-
   for Computational Linguistics: Human Language            ization. CoRR, abs/2003.11080.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-       Melvin Johnson, Mike Schuster, Quoc Le, Maxim
   ation for Computational Linguistics.                    Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
                                                           rat, Fernand a Vià c gas, Martin Wattenberg, Greg
George Doddington. 2002. Automatic Evaluation              Corrado, Macduff Hughes, and Jeffrey Dean. 2017.
  of Machine Translation Quality Using N-gram Co-          Google’s multilingual neural machine translation
  occurrence Statistics. In Proceedings of the Sec-        system: Enabling zero-shot translation. Transac-
  ond International Conference on Human Language           tions of the Association for Computational Linguis-
  Technology Research, HLT ’02, pages 138–145, San         tics, 5:339–351.
  Francisco, CA, USA. Morgan Kaufmann Publishers
  Inc.                                                    Martin Josifoski, Ivan S Paskov, Hristo S Paskov, Mar-
                                                           tin Jaggi, and Robert West. 2019. Crosslingual doc-
Markus Dreyer and Daniel Marcu. 2012. Hyter:               ument embedding as reduced-rank ridge regression.
 Meaning-equivalent semantics for translation evalu-       In Proceedings of the Twelfth ACM International
 ation. In Proceedings of the 2012 Conference of the       Conference on Web Search and Data Mining, pages
 North American Chapter of the Association for Com-        744–752.
 putational Linguistics: Human Language Technolo-
 gies, pages 162–171. Association for Computational       Armand Joulin, Piotr Bojanowski, Tomas Mikolov,
 Linguistics.                                               Hervé Jégou, and Edouard Grave. 2018. Loss in
translation: Learning bilingual word mapping with        Chi-kiu Lo, Meriem Beloucif, Markus Saers, and
  a retrieval criterion. In Proceedings of the 2018          Dekai Wu. 2014. XMEANT: Better semantic MT
  Conference on Empirical Methods in Natural Lan-            evaluation without reference translations. In Pro-
  guage Processing, pages 2979–2984, Brussels, Bel-          ceedings of the 52nd Annual Meeting of the Associa-
  gium. Association for Computational Linguistics.           tion for Computational Linguistics (Volume 2: Short
                                                             Papers), pages 765–771, Baltimore, Maryland. As-
Karthikeyan K, Zihan Wang, Stephen Mayhew, and               sociation for Computational Linguistics.
  Dan Roth. 2020. Cross-lingual ability of multilin-
  gual bert: An empirical study. In International Con-     Qingsong Ma, Ondrej Bojar, and Yvette Graham.
  ference on Learning Representations.                       2018a. Results of the WMT18 metrics shared task.
                                                             In Proceedings of the Third Conference on Machine
Alexandre Klementiev, Ivan Titov, and Binod Bhat-            Translation (WMT).
  tarai. 2012. Inducing crosslingual distributed rep-
  resentations of words. In Proceedings of COLING          Qingsong Ma, Ondřej Bojar, and Yvette Graham.
  2012, pages 1459–1474, Mumbai, India. The COL-             2018b. Results of the WMT18 metrics shared task:
  ING 2012 Organizing Committee.                             Both characters and embeddings achieve good per-
                                                             formance. In Proceedings of the Third Conference
Philipp Koehn. 2005. Europarl: A parallel corpus for         on Machine Translation: Shared Task Papers, pages
  statistical machine translation. In MT summit, vol-        671–688, Belgium, Brussels. Association for Com-
  ume 5, pages 79–86. Citeseer.                              putational Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris          Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette
  Callison-Burch, Marcello Federico, Nicola Bertoldi,        Graham. 2019. Results of the WMT19 Metrics
  Brooke Cowan, Wade Shen, Christine Moran,                  Shared Task: Segment-Level and Strong MT Sys-
  Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra         tems Pose Big Challenges. In Proceedings of the
  Constantin, and Evan Herbst. 2007. Moses: Open             Fourth Conference on Machine Translation (Volume
  source toolkit for statistical machine translation. In     2: Shared Task Papers, Day 1), pages 62–90, Flo-
  Proceedings of the 45th Annual Meeting of the As-          rence, Italy”. Association for Computational Lin-
  sociation for Computational Linguistics Companion          guistics.
  Volume Proceedings of the Demo and Poster Ses-
                                                           Nitika Mathur, Timothy Baldwin, and Trevor Cohn.
  sions, pages 177–180, Prague, Czech Republic. As-
                                                             2019. Putting Evaluation in Context: Contextual
  sociation for Computational Linguistics.
                                                             Embeddings Improve Machine Translation Evalua-
                                                             tion. In Proceedings of the 57th Annual Meeting
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
                                                             of the Association for Computational Linguistics,
 Weinberger. 2015. From word embeddings to doc-
                                                             pages 2799–2808, Florence, Italy. Association for
 ument distances. In International conference on ma-
                                                             Computational Linguistics.
 chine learning, pages 957–966.
                                                           Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a.
Anne Lauscher, Goran Glavaš, Simone Paolo Ponzetto,         Exploiting similarities among languages for ma-
  and Ivan Vulić. 2019. A general framework for im-         chine translation. CoRR, abs/1309.4168.
  plicit and explicit debiasing of distributional word
  vector spaces. arXiv preprint arXiv:1909.06092.          Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
                                                             Corrado, and Jeffrey Dean. 2013b. Distributed rep-
Alon Lavie and Abhaya Agarwal. 2007. Meteor: An              resentations of words and phrases and their com-
  automatic metric for mt evaluation with high levels        positionality. In Advances in Neural Information
  of correlation with human judgments. In Proceed-           Processing Systems 26: 27th Annual Conference on
  ings of the Second Workshop on Statistical Machine         Neural Information Processing Systems 2013. Pro-
  Translation, pages 228–231. Association for Compu-         ceedings of a meeting held December 5-8, 2013,
  tational Linguistics.                                      Lake Tahoe, Nevada, United States., pages 3111–
                                                             3119.
Chin-Yew Lin and Eduard Hovy. 2003.               Auto-
  matic evaluation of summaries using n-gram co-           Bari Saiful M Mohiuddin, Tasnim and Shafiq Joty.
  occurrence statistics. In Proceedings of the 2003 Hu-      2020. Lnmap: Departures from isomorphic as-
  man Language Technology Conference of the North            sumption in bilingual lexicon induction through
  American Chapter of the Association for Computa-           non-linear mapping in latent space.       CoRR,
  tional Linguistics, pages 150–157.                         abs/1309.4168.

Chi-kiu Lo. 2019. YiSi - a Unified Semantic MT Qual-       Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  ity Evaluation and Estimation Metric for Languages         Jing Zhu. 2002. BLEU: A Method for Automatic
  with Different Levels of Available Resources. In           Evaluation of Machine Translation. In Proceedings
  Proceedings of the Fourth Conference on Machine            of the 40th Annual Meeting on Association for Com-
  Translation (Volume 2: Shared Task Papers, Day             putational Linguistics, ACL ’02, pages 311–318,
  1), pages 507–513, Florence, Italy. Association for        Stroudsburg, PA, USA. Association for Computa-
  Computational Linguistics.                                 tional Linguistics.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt         Tal Schuster, Ori Ram, Regina Barzilay, and Amir
 Gardner, Christopher Clark, Kenton Lee, and Luke            Globerson. 2019. Cross-lingual alignment of con-
 Zettlemoyer. 2018. Deep contextualized word rep-            textual word embeddings, with applications to zero-
 resentations. In Proceedings of the North American          shot dependency parsing. In Proceedings of the
 Chapter of the Association for Computational Lin-           2019 Conference of the North American Chapter of
 guistics (NAACL).                                           the Association for Computational Linguistics: Hu-
                                                             man Language Technologies, Volume 1 (Long and
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.          Short Papers), pages 1599–1613, Minneapolis, Min-
  How multilingual is multilingual BERT? In Pro-             nesota. Association for Computational Linguistics.
  ceedings of the 57th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 4996–       Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru
  5001, Florence, Italy. Association for Computa-            Komachi. 2018. RUSE: Regressor using sentence
  tional Linguistics.                                        embeddings for automatic machine translation eval-
                                                             uation. In Proceedings of the Third Conference on
Maja Popović. 2012. Morpheme- and POS-based                 Machine Translation (WMT).
 IBM1 and language model scores for translation
 quality estimation. In Proceedings of the Seventh         Anders Søgaard, Sebastian Ruder, and Ivan Vulić.
 Workshop on Statistical Machine Translation, pages          2018. On the limitations of unsupervised bilingual
 133–137, Montréal, Canada. Association for Com-            dictionary induction. In Proceedings of the 56th An-
 putational Linguistics.                                     nual Meeting of the Association for Computational
                                                             Linguistics (Volume 1: Long Papers), pages 778–
Maja Popović. 2017. chrF++: Words Helping Char-             788, Melbourne, Australia. Association for Compu-
 acter N-grams. In Proceedings of the Second Con-            tational Linguistics.
 ference on Machine Translation, pages 612–618,
 Copenhagen, Denmark.                                      Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
                                                             chine translation evaluation versus quality estima-
Maja Popović, David Vilar, Eleftherios Avramidis, and       tion. Machine translation, 24(1):39–50.
 Aljoscha Burchardt. 2011. Evaluation without refer-
 ences: IBM1 scores as evaluation metrics. In Pro-         Lucia Specia, Kashif Shah, Jose G.C. de Souza, and
 ceedings of the Sixth Workshop on Statistical Ma-           Trevor Cohn. 2013. QuEst - a translation quality es-
 chine Translation, pages 99–103, Edinburgh, Scot-           timation framework. In Proceedings of the 51st An-
 land. Association for Computational Linguistics.            nual Meeting of the Association for Computational
                                                             Linguistics: System Demonstrations, pages 79–84,
                                                             Sofia, Bulgaria. Association for Computational Lin-
Ella Rabinovich, Noam Ordan, and Shuly Wintner.              guistics.
   2017. Found in translation: Reconstructing phylo-
   genetic language trees from translations. In Proceed-   Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.
   ings of the 55th Annual Meeting of the Association         Abstractive document summarization with a graph-
   for Computational Linguistics (Volume 1: Long Pa-          based attentional neural model. In Proceedings
   pers), pages 530–540, Vancouver, Canada. Associa-          of the 55th Annual Meeting of the Association for
   tion for Computational Linguistics.                        Computational Linguistics (Volume 1: Long Papers),
                                                              pages 1171–1181. Association for Computational
Alec Radford, Karthik Narasimhan, Tim Salimans,               Linguistics.
  and Ilya Sutskever. 2018. Improving language
  understanding by generative pre-training. URL       Ivan Vulić, Goran Glavaš, Roi Reichart, and Anna Ko-
  https://s3-us-west-2. amazonaws. com/openai-           rhonen. 2019. Do we really need fully unsuper-
  assets/researchcovers/languageunsupervised/language    vised cross-lingual embeddings? In Proceedings of
  understanding paper. pdf.                              the 2019 Conference on Empirical Methods in Nat-
                                                         ural Language Processing and the 9th International
Andreas Rücklé, Steffen Eger, Maxime Peyrard, and      Joint Conference on Natural Language Processing
  Iryna Gurevych. 2018. Concatenated power mean         (EMNLP-IJCNLP), pages 4398–4409.
  word embeddings as universal cross-lingual sen-
  tence representations. arXiv.                       Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015.
                                                         Normalized word embedding and orthogonal trans-
Alexander M. Rush, Sumit Chopra, and Jason Weston.       form for bilingual word translation. In Proceedings
  2015. A neural attention model for abstractive sen-    of the 2015 Conference of the North American Chap-
  tence summarization. In Proceedings of the 2015        ter of the Association for Computational Linguistics:
  Conference on Empirical Methods in Natural Lan-        Human Language Technologies, pages 1006–1011,
  guage Processing, pages 379–389. Association for       Denver, Colorado. Association for Computational
  Computational Linguistics.                             Linguistics.

Peter H Schönemann. 1966. A generalized solution of       Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy
  the orthogonal procrustes problem. Psychometrika,          Guo, Jax Law, Noah Constant, Gustavo Hernan-
  31(1):1–10.                                                dez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan
Sung, et al. 2019. Multilingual universal sen-
  tence encoder for semantic retrieval. arXiv preprint
  arXiv:1907.04307.
Elizaveta Yankovskaya, Andre Tättar, and Mark Fishel.
   2019. Quality Estimation and Translation Metrics
   via Pre-trained Word and Sentence Embeddings. In
   Proceedings of the Fourth Conference on Machine
   Translation (Volume 3: Shared Task Papers, Day
   2), pages 101–105, Florence, Italy. Association for
   Computational Linguistics.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
  Weinberger, and Yoav Artzi. 2019.     Bertscore:
   Evaluating text generation with BERT. CoRR,
   abs/1904.09675.
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
  tian M. Meyer, and Steffen Eger. 2019. Moverscore:
 Text generation evaluating with contextualized em-
  beddings and earth mover distance. In Proceedings
  of the 2019 Conference on Empirical Methods in
 Natural Language Processing, Hong Kong, China.
 Association for Computational Linguistics.
7     Appendix
7.1    Zero-shot Transfer to Resource-lean
       Language
Our metric allows for estimating translation qual-
ity on new domains. However, the evaluation is
limited to those languages covered by multilin-
gual embeddings. This is a major drawback for
low-resourcing languages—e.g., Gujarati is not in-
cluded in LASER. To this end, we take multilingual
USE (Yang et al., 2019) as an illustrating exam-
ple which covers only 16 languages (in our sample
Czech, Latvian and Finish are not included in USE).
We re-align the corresponding embedding spaces
with our re-mapping functions to induce evaluation
metrics even for these languages, using only 2k
translation pairs. Table 4 shows that our metric
with a composition of re-mapping functions can
raise correlation from zero to 0.10 for cs-en and
to 0.18 for lv-en. However, for one language pair,
fi-en, we see correlation goes from negative to zero,
indicating that this approach does not always work.
This observation warrants further investigation.

 Metrics                       cs-en    fi-en    lv-en
 BLEU                          0.849    0.834    0.946
 C OSINE + LAS                 -0.001   -0.149   0.019
 C OSINE + CLP(USE)            0.072    -0.068   0.109
 C OSINE + UMD(USE)            0.056    -0.061   0.113
 C OSINE + CLP ◦ UMD(USE)      0.089    -0.030   0.162
 C OSINE + UMD ◦ CLP(USE)      0.102    -0.007   0.180

Table 4: The Pearson correlation of merics on segment-
level WMT17 dataset. ’◦’ marks the composition of
two re-mapping functions.
You can also read