Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

Page created by Katherine Hernandez
 
CONTINUE READING
Measuring sentence parallelism using Mahalanobis distances:
               The NRC unsupervised submissions to the WMT18
                      Parallel Corpus Filtering shared task
                            Patrick Littell, Samuel Larkin, Darlene Stewart,
                               Michel Simard, Cyril Goutte, Chi-kiu Lo
                                  National Research Council of Canada
                               1200 Montreal Road, Ottawa ON, K1A 0R6
                             Firstname.Lastname@cnrc-nrc.gc.ca

                        Abstract                                in a given language pair to assess the quality of
                                                                the corpus at hand; the corpus to be evaluated is
      The WMT18 shared task on parallel corpus fil-
                                                                often the only one available.1 If we want to assess
      tering (Koehn et al., 2018b) challenged teams
      to score sentence pairs from a large high-                the quality of one corpus, we cannot rely on a su-
      recall, low-precision web-scraped parallel cor-           pervisory signal derived from additional, cleaner
      pus (Koehn et al., 2018a). Participants could             corpora. We therefore do not utilize the additional
      use existing sample corpora (e.g. past WMT                parallel corpora (except as additional sources of
      data) as a supervisory signal to learn what a             monolingual data).
      “clean” corpus looks like. However, in lower-                The systems described in this paper were in-
      resource situations it often happens that the
                                                                spired instead by anomaly detection approaches:
      target corpus of the language is the only sam-
      ple of parallel text in that language. We there-          can we instead attempt to identify sentence pairs
      fore made several unsupervised entries, setting           that are, in some way, “strange” for this dataset?
      ourselves an additional constraint that we not            Considering each sentence pair as a draw from a
      utilize the additional clean parallel corpora.            distribution of high-dimensional vectors, we de-
      One such entry fairly consistently scored in              fine an anomalous sentence pair as one whose
      the top ten systems in the 100M-word condi-               draw was improbable compared to the probabil-
      tions, and for one task—translating the Euro-             ity of drawing its component sentences indepen-
      pean Medicines Agency corpus (Tiedemann,
                                                                dently. The resulting measure, conceptually simi-
      2009)—scored among the best systems even
      in the 10M-word conditions.                               lar to pointwise mutual information albeit couched
                                                                in terms of Mahalanobis distances rather than ac-
1     Introduction and motivation                               tual probabilities, is detailed in §3.
                                                                   A submission based primarily on this one mea-
The WMT18 shared task on parallel corpus filter-
                                                                surement (with some pre- and post-processing
ing assumes (but does not require) a supervised
                                                                to avoid duplicate and near-duplicate sentences)
learning approach. Given
                                                                performed consistently above the median in the
    1. a set of “clean” German-English parallel cor-            100M-word conditions, and for a few tasks (par-
       pora including past WMT data, Europarl                   ticularly EMEA translation) was among the top
       (Koehn, 2005), etc., and                                 systems even for the 10M-word conditions. It
                                                                was also the #2 system in one of the dev condi-
    2. a large, potentially “dirty” corpus (i.e., one
                                                                tions (WMT newstest2017, NMT trained on 100M
       that may contain non-parallel data, non-
                                                                words), which is surprising given that it could not
       linguistic data, etc.) scraped from the internet
                                                                have overfit to the development set; it did not uti-
       (Koehn et al., 2018a),
                                                                lize the WMT17 development set in any way.
can one identify which sentences from (2) are
clean? Supervised learning is an obvious approach               2       Overall architecture
in well-resourced languages like German and En-                 The highest-ranked submission of our unsu-
glish, in which there exist well-cleaned parallel               pervised submissions,   NRC-seve-bicov,
corpora across various domains.                                     1
                                                                     We are thinking in particular of the English-Inuktitut
   However, in much lower-resourced languages,                  translation pair, which is a long-standing research interest of
we generally do not have multiple parallel corpora              NRC (e.g. Martin et al., 2003).

                                                          900
     Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 900–907
           Belgium, Brussels, October 31 - Novermber 1, 2018. c 2018 Association for Computational Linguistics
                                          https://doi.org/10.18653/v1/W18-64107
shares the same general skeleton as NRC’s                     dates, and performed comparatively poorly when
highest-ranked    supervised     submission,                  tasked with training full sentences. To mitigate
NRC-yisi-bicov (Lo et al., 2018); it dif-                     this, we ran an additional de-duplication step on
fers primarily in the parallelism estimation                  the English side in which two sentences that dif-
component (§2.3).                                             fer only in numbers (e.g., “14 May 2017” and “19
                                                              May 1996”) were considered duplicates.
2.1   Training sentence embeddings
                                                                 Without numerical de-duplication, we believe
We began by training monolingual sentence em-
                                                              the parallelism estimation step in §2.3 would have
beddings using sent2vec (Pagliardini et al.,
                                                              had too much of a bias towards short numerical
2018), on all available monolingual data. This
                                                              sentences. It is, after all, essentially just looking
included the monolingual data available in the
                                                              for sentence pairs that it considers likely given the
“clean” parallel training data. That is to say, we
                                                              distribution of sentence pairs in the target corpus;
did not completely throw out the clean parallel
                                                              if the corpus has a large number of short numeri-
data for this task, we simply used it as two un-
                                                              cal sentences (and it appears to), the measurement
aligned monolingual corpora.
                                                              will come to prefer those, whether or not they are
   We trained sentence vectors of 10, 50, 100, 300,
                                                              useful for the downstream task.
and 700 dimensions; our final submissions used
the 300-dimensional vectors as a compromise be-                  The additional de-duplication also had a practi-
tween accuracy (lower-dimensional vectors had                 cal benefit in that the resulting corpus was much
lower accuracy during sanity-checking) and effi-              smaller, allowing us to perform calculations in
ciency (higher-dimensional vectors ended up ex-               memory (e.g., that in §3.2) on the entire corpus
ceeding our memory capacity in downstream com-                at once rather than having to approximate them in
ponents).                                                     mini-batches.
   In a system such as this, which is looking for                We also discarded sentence pairs that were ex-
“strange” sentence pairs, training on additional              actly the same on each side, in which one sen-
monolingual data beyond the target corpus car-                tence contained more than 150 tokens, in which
ries some risks. If the additional monolingual data           the two sentences’ numbers did not match, or
were to have very different domain characteristics            in which there were suspiciously non-German or
(say, mostly religious text in the first language and         non-English sentences according to the pyCLD2
mostly medical text in the second), then the two              language detector2 . When pyCLD2 believed a pu-
vector spaces could encode different types of sen-            tatively German sentence to be something other
tence as “normal”. On the other hand, not using               than German with certainty greater than 0.5, or a
additional monolingual data carries its own risks;            putatively English sentence to be something other
monolingual data that is domain-balanced could                than English with certainty greater than 0.5, it was
help to mitigate domain mismatches in the target              discarded.
parallel data (say, newswire text erroneously mis-
aligned to sequences of dates).
                                                              2.3     Parallelism estimation
2.2   Pre-filtering
Although the input data had already been de-                  With sentence vectors (§2.1) for the reduced cor-
duplicated by the shared task organizers, we did              pus (§2.2) in hand, we set out to estimate the de-
an additional de-duplication step in which email              gree of parallelism of sentence pairs. A novel
addresses and URLs were replaced with a place-                measure of parallelism, based on ratios of squared
holder token and numbers were removed, before                 Mahalanobis distances, performed better on a syn-
deciding which sentences were duplicates. We                  thetic dataset than some more obvious measure-
had noticed that large amounts of data consisted              ments, and the single-feature submission based on
of short sentences that were largely numbers (for             it was our best unsupervised submission.
example, long lists of dates). Although these sen-              We also made several other unsupervised mea-
tences were indeed unique, we noticed that several            surements:
of our parallelism measurements ended up prefer-
ring such sentences to such an extent that the re-
sulting MT training sets were disproportionately                 2
                                                                     https://github.com/aboSamoor/pycld2

                                                        901
1. Perplexity of the German sentence accord-                           not just the one that happened to come first in the
     ing to a 6-gram KenLM language model3                               original corpus.
     (Heafield, 2011)
                                                                         3     Mahalanobis ratios for parallelism
  2. Perplexity of the English sentence according
                                                                               assessment
     to a 6-gram KenLM language model
  3. The ratio between (1) and (2), to find sen-                         As mentioned in §2.3, we performed several un-
     tences pairs that contain different amounts of                      supervised measurements on each sentence pair;
     information                                                         of these, the measurement that best predicted par-
                                                                         alellism (on synthetic data and on our small 300-
  4. Cosine distances between German and
                                                                         sentence annotated set) was a novel measurement
     English sentence vectors, in a bilingual
                                                                         based on squared Mahalanobis distances.
     sent2vec space trained only on the target
                                                                            This measurement rests on two insights:
     corpus
                                                                             • If sentence vectors (or in our case, sentence-
   As we did not have a supervisory signal, we
                                                                               pair vectors) are normally distributed, the
did not have a principled way of choosing weights
                                                                               probability that we draw a particular vector
for these features. Instead, we simply took an
                                                                               (or a more extreme vector) is related to the
unweighted average of the above four features
                                                                               squared Mahalanobis distance via the χ2 dis-
and the Mahalanobis feature in §3.2, after rescal-
                                                                               tribution.
ing each to the interval [0.0, 1.0]. As seen in
§5, systems based on this feature combination                                • If the two sentences relate the same infor-
(NRC-mono-bicov and NRC-mono) were out-                                        mation, the probability of drawing the vec-
performed by our single-feature system in most                                 tor for that pair should not be much less than
conditions.                                                                    the probability of drawing the individual sen-
   We also considered combinations of these unsu-                              tence vectors in isolation.
pervised measurements with supervised measure-                              While Mahalanobis distance is a common sta-
ments, but this attempt was also unsuccessful com-                       tistical measurement, particularly in anomaly de-
pared to a system that used only a single super-                         tection (e.g. Reed and Yu, 1990), it is not com-
vised measurement for sentence pair ranking (Lo                          monly used in machine translation, so we briefly
et al., 2018).                                                           introduce it below.4
2.4    Post-filtering                                                    3.1    Mahalanobis distance
After scoring each sentence for parallelism, we                          The probability of a draw from a univariate nor-
performed another de-duplication step. In this                           mal distribution can be related to its distance to
step, we iterated over each target-language sen-                         the mean in terms of standard deviations (the z-
tence in order of parallelism (that is, sentences as-                    score). In a multivariate normal distribution, how-
sessed to have the highest parallelism were consid-                      ever, just measuring the Euclidean distance to the
ered first), and removed pairs that only consisted                       mean can lead to incorrect conclusions; visual in-
of bigrams that had already been seen. (That is to                       spection of Figure 1a illustrates that the red vector,
say, a sentence pair was kept only if it contains a                      despite being a clear outlier, is nonetheless closer
bigram that had not previously been seen.)                               to the mean than the blue vector.
   This step has to occur after quality assessment                          Rather, the appropriate measurement for relat-
because, in contrast to regular de-duplication, the                      ing distance to probability is the square of the Ma-
sentences in question are not identical; the sen-                        halanobis distance (Mahalanobis, 1936); for a vec-
tence (and the pair it comes from) may differ in                         tor x from distribution X with correlation Σ and
quality from the sentence(s) that make it a dupli-                       mean µ:
cate, so we want to keep the best such sentence,                             4
                                                                               The following relies heavily on the explanation in Boggs
   3
    Although we assumed that high perplexity sentences                   (2014). Note that this explanation is also concerned with
would be worse—that they might be ungrammatical, for                     the square of the Mahalanobis distance rather than the Ma-
example—sanity checking suggested higher-perplexity sen-                 halanobis distance; it is typical for authors to describe both
tences were actually better. Error analysis later suggested that         as “Mahalanobis distance” in prose (cf. Warren et al., 2011,
many non-parallel (or parallel but non-informative) sentences            p. 10). It is also typical to use “Mahalanobis distance” to
were short, possibly explaining why taking perplexity as a               specifically refer to Mahalanobis distance from a point to the
positive feature resulted in higher scores in sanity-checking.           mean, although this distance is defined for any two points.

                                                                   902
(a) Euclidean distance                                     (b) Mahalanobis distance

Figure 1: Euclidean distance to the mean in a multivariate normal distribution is not necessarily related to
probability; in figure (a), the red vector, despite being an outlier, is closer to the mean. In figure (b), we
have rescaled and decorrelated the distribution; Euclidean distance measured in the resulting space (the
Mahalanobis distance) can be related to probability through the χ2 distribution.

                                                                              to have zero mean—this simply makes the calcu-
                                                                              lation and presentation easier—and transform the
           d2 (x) = (x − µ)T Σ−1 (x − µ)                          (1)                                1
                                                                              resulting matrix by Σ− 2 .
  This is equivalent to decorrelating and rescal-                                For each sentence vector pair hl1 , l2 i (after re-
ing to unit variance in all dimensions, via the in-                           centering), we consider three vectors in the trans-
verse square root of the correlation matrix (“Ma-                             formed space:
halanobis whitening”), and then measuring the                                   • the vector e1 corresponding only to l1 ’s con-
squared Euclidean distance to the mean in the re-                                 tribution to the concatenated and transformed
sulting space.                                                                    vector (as if l2 = ~0)
                                                                                • the vector e2 corresponding only to l2 ’s con-
                              − 12       − 21
       2
      d (x) = (x − µ) Σ T
                                     Σ          (x − µ)           (2)             tribution (as if l1 = ~0)
                − 12                 T          − 12                            • the vector e corresponding to the transforma-
           = (Σ        (x − µ)) (Σ                     (x − µ))   (3)
                   1                                                              tion of the concatenation of l1 and l2
           = kΣ− 2 (x − µ)k22                                     (4)

   Figure 1b illustrates the same distribution trans-
                1                                                                                   1
formed by Σ− 2 ; we can see that now the magni-                                          e1 = Σ− 2 (l1 , ~0)                    (5)
tude of the outlier red vector is greater than the                                                − 21
                                                                                         e2 = Σ          (~0, l2 )              (6)
magnitude of the blue vector.
                                                                                                  − 21
   As mentioned above, the squared magnitudes                                              e=Σ           (l1 , l2 ) = e1 + e2   (7)
can be used to calculate probabilities, but in prac-
tice the probabilities were so similar in higher-                               The measurement m we are interested in is the
dimensional spaces as to be identical. There re-                              squared magnitude of the combined vector, di-
mains the possibility, however, that the magni-                               vided by the sum of the squared magnitudes of e1
tudes themselves remain sufficiently informative;                             and e2 alone.
this was borne out in practice.
                                                                                                              kek22
3.2    Calculating the magnitude ratios                                                       m=                                (8)
                                                                                                     ke1 k22   + ke2 k22
We have high-dimensional vectors, trained mono-
lingually, of German and English sentences (§2.1).                               Roughly speaking, does the sentence pair vec-
We consider their joint distribution by simply con-                           tor e in Mahalanobis space give more information
catenating their vectors; there is no additional                              (expressed in terms of its squared magnitude) than
utility here in learning a translation between the                            the component sentence vectors e1 and e2 do on
monolingual spaces. We recenter the distribution                              their own? If so, we consider them unlikely to

                                                                        903
p              0.1        0.2           0.3        0.4        0.5
                            Mahalanobis       0.977      0.976         0.974      0.972      0.972
                              Linear          0.944      0.930         0.920      0.914      0.913
                             Nonlinear        0.871      0.871         0.897      0.900      0.905

Table 1: Accuracy of distinguishing parallel (i.e., related by a translation matrix T ) vs. non-parallel
(i.e., random) vectors, from a synthetic dataset of 100,000 pairs of 50-dimensional vectors, plus standard
normal additive noise. p represents the proportion of parallel pairs in the dataset.

                                    σ              1.0        2.0         3.0     4.0       5.0
                                Mahalanobis       .974       .778        .665    .617      .597
                                  Linear          .920       .722        .640    .606      .592
                                 Nonlinear        .897       .658        .600    .586      .582

Table 2: Accuracy of distinguishing parallel (i.e., related by a translation matrix T ) vs. non-parallel
(i.e., random) vectors, from a synthetic dataset of 100,000 pairs of 50-dimensional vectors and “true”
proportion p = 0.3, with varying degrees of additive noise. σ represents the standard deviation of the
additive noise added to each of L1 and L2.

be parallel. We take the resulting value m to be                      In code, this is a very simple calculation (only
the ranking (with lower values being better) for the               about 15 lines of Python+NumPy) and efficient
post-filtering step described in §2.4.                             (taking only a few minutes for millions of sen-
                                                                   tences), provided one has enough system memory
   Implementation-wise, we do not actually have
                                                                   to calculate it in one fell swoop. A sample imple-
to concatenate l2 or l1 with zeros in order to calcu-
                                                                   mentation is given in Figure 2.
late (5) and (6), we can just multiply l1 and l2 by
                                  1
the relevant sub-matrix of Σ− 2 . It is also unnec-                4     Internal results
essary to actually transform the vector correspond-
ing to the concatenation of hl1 , l2 i; the result is just         4.1    Synthetic data
the element-wise sum of e1 and e2 .
                                                                   The unsupervised measurements on the sentence
                                                                   vectors were first tested on purely synthetic data:
                                                                   two sets of random normal vectors L1 and L2, in
def mahalanobis_whitening(X):
  # inverse square root of covariance                              which some proportion p of vectors in L1 corre-
  cov = np.cov(X, rowvar=False)                                    sponded to L2 via a linear transformation T, and
  inv_cov = np.linalg.inv(cov)                                     some proportion of vectors did not. We also added
  L, V = np.linalg.eig(inv_cov)
  diag = np.diag(np.sqrt(L))                                       some Gaussian noise to each of L1 and L2, so
  return V.dot(diag).dot(V.T)                                      that this transformation would not be perfect (as
                                                                   it would not be in real data). We varied the pro-
def ssq(X): # sum of squares
  return np.sum(X*X, axis=1)                                       portion of “true” pairs, and the proportion of addi-
                                                                   tive noise, to test how robust these measurements
def mahalanobis_ratio(L1, L2):
  L1 -= L1.mean(axis=0)
                                                                   would be in a variety of noise conditions.
  L2 -= L2.mean(axis=0)                                               Accuracy measurements on this data were made
  L = np.concatenate([L1,L2], axis=1)                              by thresholding scores so that the top p scores are
  whitener = mahalanobis_whitening(L)
  E1 = L1.dot(whitener[:L1.shape[1],:])                            set to 1.0 and the rest to 0.0.5 This is also how we
  E2 = L2.dot(whitener[L1.shape[1]:,:])                            evaluate accuracy during sanity checking, below.
  return ssq(E1+E2) / (ssq(E1) + ssq(E2))
                                                                      Table 1 contrasts three systems:
                                                                       5
                                                                         Since the overall task is a ranking task, rather than a clas-
Figure 2:  Sample implementation of the Ma-                        sification task, we do not at any point have to set a particular
halanobis ratio calculation in Python, for two                     threshold for keeping data; this is a way in which the task
n×d NumPy arrays representing n samples of d-                      at hand is easier than a typical anomaly detection task. We
                                                                   therefore simply use the correct proportion to set the thresh-
dimensional sentence vectors for two languages.                    olds.

                                                             904
1. (Mahalanobis) We perform the Mahalanobis                              dimensionality did not lead to indistinguishable
     ratio calculation described in §3.2.                                  measurements again.
                                                                              Sanity checking (Table 3) confirmed that higher
  2. (Linear) We learn a linear regression be-
                                                                           dimensionality does not necessarily lead to poorer
     tween L1 and L2, transform L1 according the
                                                                           discrimination: while 10-dimensional vectors only
     resulting matrix, and measure the cosine sim-
                                                                           led to 44.1% accuracy in discriminating parallel
     ilarity between the result and L2.
                                                                           from non-parallel pairs, 300-dimensional vectors
  3. (Nonlinear) System (2), but instead of a lin-                         gave 63.4% accuracy.
     ear regression we construct a simple two-
     layer perceptron with a ReLU nonlinearity.6                            Dimensionality      10     50      100    300
                                                                            Accuracy           .441   .548    .483   .634
    In each condition, the Mahalanobis measure-
ment outperformed the other measurements. It                               Table 3: Sanity-checking results on 300 annotated
may, of course, be that the conditions of this syn-                        sentences, for the Mahalanobis calculation (§3.2)
thetic data are unlike real data—the relationship                          on 10-, 50-, 100-, and 300-dimensional sentence
between the German and English sentence vec-                               vectors.
tors might, for example, be better approximated
with a nonlinear relationship—but, given the com-                             It is unclear why 100-dimensional vectors
paratively robust performance of the Mahalanobis                           perform more poorly than both 50- and 300-
measurement against a variety of noise conditions,                         dimensional vectors, but in any case this dataset
we prioritized our development time to exploring                           only has 300 samples and we do not want to
it further.                                                                put too much stock in the results. The real pur-
                                                                           pose of this trial was to determine if the curse of
4.2    Sanity checking                                                     dimensionality affects the Mahalanobis measure-
                                                                           ment adversely, and it does not appear to do so.
We also annotated about 300 random sentence
                                                                           We therefore used 300-dimensional vectors in our
pairs from the target corpus, according to whether
                                                                           final submissions.
we judged them to be parallel or not. We did not
tune any parameters to this set, except to make
sure that one hyperparameter, the dimensionality                           5   Official Results
of the sentence vectors, did not lead to a numerical
underflow condition as dimensionality increased.                           Table 4 presents the results of the official evalua-
   Many of our initial attempts at measuring prob-                         tion, on seven corpora in four conditions. To help
abilities (and log probabilities) of sentence draws                        navigate the wall of numbers, keep in mind that we
in higher dimensions (e.g. higher than 50) led                             are mostly interested in the top unsupervised sys-
to the differences between probabilities being so                          tem NRC-seve-bicov, and that each table also
small that they could not be distinguished by                              presents average scores across the seven corpora,
floating-point representations, leading to a situa-                        in the bottom right corner of each.
tion in which almost all probabilities were equiv-                            In the 100M-word conditions (that is to say,
alent and no meaningful comparisons could be                               in the conditions where a statistical or neu-
made, and thus to random performance when                                  ral machine translation system was trained on
ranking sentences pairs. Keeping the measure-                              the top 100M words, as ranked by our fil-
ments in terms of distances, and not converting                            ters), we find generally strong performance, with
them to probabilities, did appear to allow fine-                           NRC-seve-bicov always performing above the
grained comparison in higher dimensions, but we                            median system and with most results in the top 10
wanted to ensure that continuing to increase the                           (among 48 submissions).
                                                                              However, we generally observe weaker down-
    6
      We did not expect this to outperform the linear version—             stream MT performance in 10M conditions, com-
after all, there is no actual nonlinearity in the relationship be-
tween L1 and L2—but nonetheless wanted to see how a non-                   pared to other competitors; performing roughly
linear regression would perform in different noise conditions.             near the median system in the NMT 10M con-
We observe, for example, that it does unsurprisingly poorly                dition and frequently below the median in the
when only a low proportion p of sentences are related, a con-
dition in which a linear regression performs comparatively                 SMT 10M condition. This suggests to us that
well.                                                                      the unsupervised systems are adequate in finding

                                                                     905
SMT, 10M-word
                   dev.                                          test
 domain         news          news         speech      laws        medical          news            IT
 corpus         newstest17    newstest18   iwslt17     Acquis      EMEA             GlobalVoices    KDE          average
 top score      23.23 (1)     29.59 (1)    22.16 (1)   21.45 (1)   28.70 (1)        22.67 (1)       25.51 (1)    24.58 (1)
 seve-bicov     19.66 (33)    25.96 (32)   18.64 (35) 18.78 (23) 27.94 (5)          20.05 (28)      21.38 (41)   22.13 (29)
 mono-bicov     19.61 (35)    25.13 (36)   17.86 (39) 16.59 (35) 24.21 (37)         19.97 (34)      22.07 (37)   20.97 (38)
 mono           17.98 (41)    23.49 (41)   16.63 (41) 15.49 (40) 23.09 (40)         18.65 (40)      21.39 (40)   19.79 (41)
                                                    SMT, 100M-word
 top score      25.80 (1)     31.35 (1)    23.17 (1)   22.51 (1)   31.45 (1)        24.00 (1)       26.93 (1)    26.49 (1)
 seve-bicov     25.61 (11)    31.11 (8)    22.84 (10) 22.19 (15) 31.20 (3)          23.67 (10)      26.47 (18)   26.25 (9)
 mono-bicov     25.65 (5)     31.12 (5)    22.84 (10) 22.37 (8)    31.11 (7)        23.75 (7)       26.19 (30)   26.23 (10)
 mono           25.45 (14)    30.63 (21)   22.72 (20) 22.06 (21) 30.74 (20)         23.70 (9)       26.20 (28)   26.01 (19)
                                                      NMT, 10M-word
                   dev.                                           test
  domain        news          news          speech      laws        medical          news           IT
  corpus        newstest17    newstest18    iwslt17     Acquis      EMEA             GlobalVoices   KDE          average
  top score     29.44 (1)     36.04 (1)     25.64 (1)   25.57 (1)   32.72 (1)        26.72 (1)      28.25 (1)    28.62 (1)
  seve-bicov    24.49 (27)    30.32 (27)    21.47 (24) 22.57 (15) 31.71 (2)          23.08 (27)     22.89 (27)   25.34 (21)
  mono-bicov    23.38 (30)    28.86 (32)    19.33 (34) 19.03 (29) 26.45 (32)         22.03 (32)     23.72 (23)   23.07 (30)
  mono          20.83 (35)    24.97 (37)    17.19 (37) 16.57 (38) 23.79 (38)         19.75 (35)     21.85 (31)   20.69 (35)
                                                     NMT, 100M-word
  top score     32.41 (1)     39.85 (1)     27.43 (1)   28.43 (1)   36.72 (1)        29.26 (1)      30.92 (1)    32.06 (1)
  seve-bicov    32.10 (2)     39.39 (7)     27.09 (6)   28.31 (5)   36.30 (10)       28.94 (9)      30.12 (16)   31.69 (8)
  mono-bicov    31.67 (9)     38.86 (15)    27.10 (5)   28.15 (9)   35.96 (15)       28.87 (11)     30.41 (11)   31.56 (11)
  mono          31.39 (16)    38.42 (21)    26.80 (12) 27.94 (12) 35.71 (21)         28.00 (27)     30.32 (14)   31.20 (19)

Table 4: BLEU scores (and ranking, out of 48 submissions) of NRC’s unsupervised submissions: “seve”
indicates single-feature (Mahalanobis ratio) parallelism assessment, “mono” indicates parallelism as-
sessment using an unweighted ensemble of unsupervised features, “bicov” indicates that the final bigram
coverage step (§2.4) was performed. Results in the top 10 performers are bolded.

a 100M word training set7 but relatively poor at                     does NRC-seve-bicov perform notably well
sub-selecting higher-quality sentences from that                     on other out-of-domain corpora in the 10M con-
set. We think this may be because our system                         ditions.
might have a bias towards picking relatively sim-
ilar sentences, rather than the more diverse set of                  6   Future research
sentences that an MT training set needs, which is
amplified in the 10M condition.                                      The unsupervised methods described here seem
   A surprising exception to this weakness is                        promising in distinguishing parallel from non-
the European Medicines Agency (EMEA) cor-                            parallel sentence pairs, but we interpret the 10M-
pus, in which NRC-seve-bicov is the #5                               word results as suggesting they are comparatively
and #2 system in the SMT 10M and NMT                                 poor at distinguishing other MT-relevant features
10M conditions, respectively. This could sug-                        of sentence-pair quality. Considering bigram cov-
gest that competitors are overfitting to the do-                     erage (§2.4) appears to help somewhat, but more
main(s) of the training data, and performing cor-                    research is needed into mitigating the tendency of
respondingly poorly on the out-of-domain EMEA,                       these measurements to prefer an uninteresting se-
whereas NRC-seve-bicov cannot overfit in                             lection of sentences.
this manner. However, the other NRC unsu-                               Also, it is likely that a sentence-vector, even
pervised submissions, which also cannot overfit,                     a high-dimensional one, is not sufficiently fine-
have no special advantage on EMEA, and nor                           grained to choose the highest-quality pairs; the
                                                                     process described in this paper essentially says
    7
      Spot-checking a random sample of sentences suggested           that two sentences with sufficiently similar topics
to us that there were indeed roughly 100M words worth of
genuinely parallel data, but much of it would not have been          are to be considered parallel, even if there is lit-
particularly informative for machine translation. We there-          tle word-level correlation between the sentences.
fore interpret 100M results as representing one’s success at         We therefore intend to investigate a word-level
identifying parallel data, and the 10M results as represent-
ing how well one assesses usefulness-for-MT beyond paral-            analogue of the sentence-level Mahalanobis ratio
lelism.                                                              measurement.

                                                               906
References                                                         unknown spectral distribution.     IEEE Transac-
                                                                   tions on Acoustics, Speech, and Signal Processing,
Thomas Boggs. 2014.       Whitening char-                          38(10):1760–1770.
  acteristics of the Mahalanobis   distance.
  http://blog.bogatron.net/blog/                                 Jörg Tiedemann. 2009. News from OPUS: A collec-
  2014/03/11/mahalanobis-whitening/.                                 tion of multilingual parallel corpora with tools and
                                                                     interfaces. In N. Nicolov, K. Bontcheva, G. An-
Kenneth Heafield. 2011. KenLM: Faster and smaller                    gelova, and R. Mitkov, editors, Recent Advances
  language model queries. In Proceedings of the Sixth                in Natural Language Processing, volume V, pages
  Workshop on Statistical Machine Translation, WMT                   237–248. John Benjamins, Amsterdam/Philadel-
  ’11, pages 187–197, Stroudsburg, PA, USA. Associ-                  phia, Borovets, Bulgaria.
  ation for Computational Linguistics.
                                                                 Rik Warren, Robert F Smith, and Anne K Cybenko.
Philipp Koehn. 2005. Europarl: A parallel corpus for               2011. Use of Mahalanobis distance for detecting
  statistical machine translation. MT Summit 2005.                 outliers and outlier clusters in markedly non-normal
                                                                   data: A vehicular traffic example. Technical report,
Philipp Koehn, Kenneth Heafield, Mikel L. For-                     SRA International Inc., Dayton, OH.
  cada, Miquel Esplà-Gomis, Sergio Ortiz-Rojas,
  Gema Ramı́rez Sánchez, Vı́ctor M. Sánchez
  Cartagena, Barry Haddow, Marta Bañón, Marek
  Střelec, Anna Samiotou, and Amir Kamran. 2018a.
  ParaCrawl corpus version 1.0. LINDAT/CLARIN
  digital library at the Institute of Formal and Ap-
  plied Linguistics (ÚFAL), Faculty of Mathematics
  and Physics, Charles University.

Philipp Koehn, Huda Khayrallah, Kenneth Heafield,
  and Mikel Forcada. 2018b. Findings of the WMT
  2018 shared task on parallel corpus filtering. In Pro-
  ceedings of the Third Conference on Machine Trans-
  lation, Volume 2: Shared Task Papers, Brussels, Bel-
  gium. Association for Computational Linguistics.

Chi-kiu Lo, Michel Simard, Darlene Stewart, Samuel
  Larkin, Cyril Goutte, and Patrick Littell. 2018. Ac-
  curate semantic textual similarity for cleaning noisy
  parallel corpora using semantic machine translation
  evaluation metric: The NRC supervised submissions
  to the parallel corpus filtering task. In Proceed-
  ings of the Third Conference on Machine Transla-
  tion (WMT 2018).

Prasanta Chandra Mahalanobis. 1936. On the gener-
  alised distance in statistics. Proceedings of the Na-
  tional Institute of Sciences of India, 2:49–55.

Joel Martin, Howard Johnson, Benoı̂t Farley, and Anna
  Maclachlan. 2003. Aligning and using an English-
  Inuktitut parallel corpus. In Proceedings of the HLT-
  NAACL 2003 Workshop on Building and using par-
  allel texts: Data driven machine translation and
  beyond, Volume 3, pages 115–118. Association for
  Computational Linguistics.

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi.
 2018. Unsupervised learning of sentence embed-
 dings using compositional n-gram features. In Pro-
 ceedings of the 2018 Conference of the North Amer-
 ican Chapter of the Association for Computational
 Linguistics: Human Language Technologies, Vol-
 ume 1 (Long Papers), pages 528–540. Association
 for Computational Linguistics.

Irving S Reed and Xiaoli Yu. 1990. Adaptive multiple-
   band CFAR detection of an optical pattern with

                                                           907
You can also read