Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation

Page created by Jessie Howell
 
CONTINUE READING
Scalable Cross Lingual Pivots to Model Pronoun Gender for
                                                                     Translation

                                                                        Kellie Webster and Emily Pitler
                                                                                 Google Research
                                                                         {websterk|epitler}@google.com

                                                            Abstract                         Spanish Britney Jean Spears... ∅ Adquirió
                                                                                             fama durante su niñez al participar en el
arXiv:2006.08881v1 [cs.CL] 16 Jun 2020

                                             Machine translation systems with inade-         programa de televisión The Mickey Mouse
                                             quate document understanding can make
                                                                                             Club (1992).
                                             errors when translating dropped or neu-
                                             tral pronouns into languages with gendered
                                                                                             Spanish→English He gained fame during
                                             pronouns (e.g., English). Predicting the        his childhood by participating in the televi-
                                             underlying gender of these pronouns is dif-     sion program The Mickey Mouse Club (1992).
                                             ficult since it is not marked textually and
                                             must instead be inferred from coreferent
                                                                                             Figure 1: Typical example of pronoun gender er-
                                             mentions in the context. We propose a
                                                                                             rors when translating from pronouns without tex-
                                             novel cross-lingual pivoting technique for
                                                                                             tual gender marking (e.g., dropped or possessive
                                             automatically producing high-quality gen-
                                                                                             pronouns in Spanish) to a language with gendered
                                             der labels, and show that this data can be
                                                                                             pronouns.
                                             used to fine-tune a BERT classifier with
                                             92% F1 for Spanish dropped feminine pro-
                                             nouns, compared with 30-51% for neural
                                             machine translation models and 54-71% for          Such errors are worrying for their preva-
                                             a non-fine-tuned BERT model. We aug-            lence. An oracle study found that augment-
                                             ment a neural machine translation model         ing an NMT system with perfect pronoun in-
                                             with labels from our classifier to improve
                                                                                             formation could give up to 6 BLEU points
                                             pronoun translation, while still having par-
                                             allelizable translation models that trans-
                                                                                             improvement (Stojanovski and Fraser, 2018).
                                             late a sentence at a time.                      Further, the DiscoMT shared task on pronoun
                                                                                             prediction tested multiple language pairs and
                                         1   Introduction                                    found that the most difficult language pair by
                                                                                             far was Spanish→English and that “[t]his re-
                                         While neural modeling has solved many chal-
                                                                                             sult reflects the difficulty of translating null
                                         lenges in machine translation, a number
                                                                                             subjects” (Loáiciga et al., 2017).
                                         remain, notably document-level consistency.
                                         Since standard architectures are sentence-              We address document-level modeling for
                                         based, they can misrepresent pronoun gender          dropped and neutral personal pronouns in
                                         since cues for this are often extra-sentential       Spanish. We significantly improve the ability
                                         (Läubli et al., 2018). For example, in Fig-         of a strong sentence-level translator to produce
                                         ure 1,1 there is a dropped subject and a neu-        correctly gendered pronouns in three stages.
                                         tral possessive pronoun in the Spanish, which        First, we automatically collect high-quality
                                         are both translated with masculine pronouns          pronoun gender-tagged examples at scale us-
                                         in English despite context indicating they re-       ing a novel cross-lingual pivoting technique
                                         fer to Britney Jean Spears. Recommendations          (Section 3). This technique easily scales to
                                         for the use of discourse in translation are de-      produce large amounts of data without explicit
                                         veloped in Sim Smith (2017).                         human labeling, while remaining largely lan-
                                            1
                                                                                              guage agnostic. This means our technique may
                                              Source: https://es.wikipedia.org/wiki/Britney_Spears,
                                         Translation: Retrieved from translate.google.com,    also be easily scaled to produce data for other
                                         27 Feb 2019.                                         languages than Spanish; we present results on
Properties             Surface Form             sentence cannot be used to make translation
                            En      Es               decisions. In a language like Spanish, this
     Nominative, Masc.      he él or ∅              means that critical cues for pronoun gender
     Nominative, Fem.       she ella or ∅            are simply not present at decoding time.
     Possessive, Masc.      his     su                  There have been a number of direc-
     Possessive, Fem.       her     su               tions exploring how to introduce gen-
                                                     eral contextual awareness into translation.
Table 1: Masculine and feminine pronouns cannot      Tiedemann and Scherrer (2017) investigated a
be distinguished by their surface forms in Spanish   simple technique of providing multiple sen-
in both the dropped and possessive case.             tences, concatenated together, to a translator
                                                     and learning a model to translate the final sen-
                                                     tence in the stream using the preceding one(s)
Spanish as a proof of concept.
                                                     for contextualization. Unfortunately, no sub-
  Next, we show how to fine-tune BERT                stantial performance gain was seen. Next, the
(Devlin et al., 2019), a state-of-the-art pre-       Transformer model (Vaswani et al., 2017) was
trained language model, to classify pronoun          extended by introducing a second, context de-
gender using our new dataset (Section 4). Fi-        coder (Wang et al., 2017; Zhang et al., 2018).
nally, we add context tags from our BERT             Complementary to this, Kuang et al. (2018)
model into standard MT sentence input, yield-        introduces two caches, one lexical and the
ing a 8.8% F1 improvement in feminine pro-           other topical, to store information from con-
noun generation (Section 5).                         text pertinent to translating future sentences.
2   Background                                          Specific to the translation of implicit or
                                                     dropped pronouns, Wang et al. (2018) shows
In this section, we overview particular nuances      that modeling pronoun prediction jointly
associated with pronoun understanding, re-           with translation improves quality, while
sulting challenges for translation, and existing     Miculicich Werlen and Popescu-Belis (2017a)
relevant datasets.                                   presents a proof-of-concept where coreference
                                                     links are available at decoding time. Most sim-
Dropped (Implicit) Pronouns Spanish
                                                     ilar to our current work, Habash et al. (2019);
personal pronouns have less textual mark-
                                                     Moryossef et al. (2019); Vanmassenhove et al.
ing of gender than their English counter-
                                                     (2018) improve gender fidelity in translations
parts. Where they may be inferred from con-
                                                     in of first-person singular pronouns in lan-
text, Spanish drops subject pronouns él/ella
                                                     guages with grammatical gender inflection
(he/she). Several other languages such as Chi-
                                                     in this position. They show promising re-
nese, Japanese, Greek, and Italian also have
                                                     sults from either injecting explicit information
dropped pronouns.
                                                     about speaker gender, or predicting it with a
Possessive Pronouns Additionally, Span-              classifier. We extend this line of work without
ish possessive pronouns do not mark the gen-         requiring human labels to train our classifier.
der of the possessor, with su(s) used where
                                                     Prior Work on Aligned Corpora Trans-
English uses his and her. In other Romance
                                                     lation of sentences has been a task at the
languages such as French, the possessive pro-
                                                     Workshop on Machine Translation (WMT)
noun agrees with the gender of the possessed
                                                     since 2006. Over this period, more and larger
object, rather than the possessor as in English.
                                                     datasets have been released for training statis-
These usages are summarized for Spanish in
                                                     tical models. Typically released data is given
Table 1 and illustrated in Figure 1.
                                                     in sentence-pair form, meaning that for each
MT of Ambiguous Pronouns Document-                   source language sentence (e.g. Spanish), there
level consistency is an outstanding challenge        is a target language sentence (e.g. English)
for machine translation (Läubli et al., 2018).      which may be considered its true translation.
One reason is that standard techniques trans-        For a discussion of the adequacy of this ap-
late the sentences independently of one an-          proach, see Guillou and Hardmeier (2018).
other, meaning that context outside a given             Translation of Spanish sentences to En-
(A) Page Alignment
 Title: ”Mitsuko Shiga”
                      https://en.wikipedia.org/wiki/Mitsuko_Shiga
                      https://es.wikipedia.org/wiki/Mitsuko_Shiga
 (B) Sentence Alignment
 Spanish              Publicó numerosas antologı́as de su poesa durante su vida, in-
                      cluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki.
 Spanish→English      He published numerous anthologies of his poetry during his life,
                      including Fuji no Mi, Asa Tsuki, Asa Ginu, and Kamakura Zakki.
 English              She published numerous anthologies of her poetry during her life-
                      time, including Fuji no Mi (”Wisteria Beans”), Asa Tsuki (”Morn-
                      ing Moon”), Asa Ginu (”Linen Silk”), and Kamakura Zakki (”Ka-
                      makura Miscellany”).
 (C) Pronoun Tagging ∅/She Publicó numerosas antologas de su/her poesa durante
                      su/her vida, incluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y
                      Kamakura Zakki.

Table 2: Example of how pronoun translations are extracted from comparable texts. The English and
Spanish sentences were extracted from Wikipedia articles about the same subject, the sentences were
aligned due to the similarities in their translations, and the labels were extracted by matching the Spanish
dropped and ambiguous pronouns with the English pronouns.

glish was most recently contested at WMT’132            second person pronouns and noisy alignment.
(Bojar et al., 2013), which offered partici-               High quality targeted datasets for discourse
pants 14,980,513 sentence pairs from Eu-                phenomena were explored in Bawden et al.
roparl3 (Habash et al., 2017), Common Crawl             (2018), however the released corpus was a
(Smith et al., 2013), the United Nations cor-           small French→English test set. In this paper,
pus4 (Ziemski et al., 2016), and news com-              we introduce a novel cross-lingual technique
mentary. While large, this dataset wasn’t               for gathering a large-scale targeted dataset
designed for document-level understanding.              without explicit human annotation.
Document boundaries are available, but cor-
respond loosely to what a user of a translation         3     Cross-Lingual Pivoting for
system might consider a document, whole days                  Pronoun Gender Data Creation
of European parliament, for instance.
                                                        Despite document-level context being critical
   DiscoMT was introduced as a workshop in
                                                        for pronoun translation quality, little of the
20135 to target the translation of pronouns
                                                        available aligned data is directly suitable for
in document context. As well as Europarl,
                                                        training models. In this section, we exploit the
Ted talk subtitles6 (Tiedemann, 2012) were in-
                                                        special properties of Wikipedia to automate
cluded for both training and evaluation. Also
                                                        the collection of high-quality gender-tagged
available for the translation of extended speech
                                                        pronouns in document context.
is Open Subtitles7 (Lison and Tiedemann,
2016; Muller et al., 2018), which contains ut-          3.1    Cross-Lingual Pivoting
terance pairs which are aligned and contextu-
                                                   Wikipedia has a number of desirable proper-
alized according to their temporal appearance
                                                   ties which we exploit to automatically tag pro-
in a film. Although subtitle datasets make
                                                   noun gender. Firstly, it is very large for head
available very wide context, they are not ideal
                                                   languages like English and Spanish. More im-
in this study due to their heavy use of first and
                                                   portantly, many non-English Wikipedia pages
  2                                                express similar content to their English coun-
    http://www.statmt.org/wmt13/translation-task.html
  3
  4
    http://www.statmt.org/europarl/                terpart given the resource’s style and format-
    https://cms.unov.org/UNCorpus/
  5                                                ting requirements, and the ease of using trans-
    http://aclweb.org/anthology/W13-3300
  6
    http://opus.nlpl.eu/TED2013.php                lation systems to generate base text for a miss-
  7
    http://opus.nlpl.eu/OpenSubtitles.php          ing non-English page. Based on these proper-
Prodrop                           Possessive
                                    Articles       he       she       Articles       his     her
         Spanish Wikipedia        1,326,469           -        -    1,326,469             -      -
         Page Alignment             420,850           -        -      420,850             -      -
         Pronoun Tagging             45,210     118,911   41,524       52,559      266,886 111,411

            Table 3: Extraction yield in each stage of the multi-lingual pivot extraction pipeline.

ties, we develop the following three stage ex-                                           Human
traction pipeline, illustrated in Table 2 with                                     he    she unclear
pipeline yield summarized in Table 3.                          Pipeline    he     221      3      36
Page Alignment Identify as pairs pages                                     she      7    179      43
with the same title in English and Spanish. We
prioritize precision and require exact match.             Table 4: Agreement between automatic labels and
                                                          human judgment for prodrop examples.
Sentence Alignment Identify sentence
pairs in aligned pages which express nearly                                              Human
the same meaning. To do this, we compare                                           his   her unclear
English translations of sentences8 from the
                                                               Pipeline    his    198      6          48
Spanish page to sentences from the English
                                                                           her     25    179          55
page. We do bipartite matching over these
English sentence-pairs, selecting pairings with           Table 5: Agreement between automatic labels and
the smallest edit distance. We require that               human judgment for possessive pronoun examples.
edit distance is maximally one half sentence
length, that paired sentences share either a
noun or verb, and that the mapping is at                  dataset with 79,240 prodrop and 187,224 pos-
most one-to-one. Table 2, step (B) shows an               sessive examples. We split the 79,240 prodrop
aligned sentence pair.                                    examples into training, development, and test
                                                          sets of size 72,120, 2,968, and 4,152, respec-
Pronoun Tagging Perform alignment9                        tively; the 187,224 possessive examples are
over the tokens in detected sentence pairs,               similarly split into sets of size 167,222, 8,862,
identifying where English uses a gendered                 and 11,140.
pronoun and Spanish has either a dropped
or possessive pronoun.      Copy the gender               Quality To assess the quality of our exam-
of the English pronoun as a label onto the                ples, we sent 993 examples for human rat-
ambiguous target pronoun, masculine for he                ing. In-house bilingual speakers were shown
and his, feminine for she and her. Step (C) in            a paragraph of Spanish text containing a la-
Table 2 shows the resulting pronoun tagging               beled pronoun and asked whether, given the
in the example.                                           context, the pronoun referred to a masculine
                                                          person, feminine person, or whether it was un-
3.2     Final Dataset
                                                          clear. Agreement between these human labels
Gender We can see from Table 3 that there                 and those produced by our automatic pipeline
are approximately three times as many mas-                is high. 84% of prodrop and 80% of posses-
culine examples extracted compared to femi-               sive pronouns could be disambiguated with the
nine examples. This is not surprising given the           provided context and, of those, 98% and 92%
disparity in representation by gender found in            of automatically generated gender tags agreed
Wikipedia (Wagner et al., 2016). For the final            with human judgment. Where the provided
dataset, we down-sample masculine examples                paragraph was not enough, it was typically the
to achieve a 1:1 gender balance, yielding a final         case that relevant context was in the directly
  8
      From an in-house tool.                              preceding paragraph. Confusion matrices are
  9
      We use an implementation of Vogel et al. (1996).    given in Tables 4 and 5.
Masculine                 Feminine
                                                        P     R      F1        P        R     F1
       Prodrop         Baseline MT (Sentences)         56.2 81.3 66.5         91.3     18.6 30.9
                       Baseline MT (Contexts)          62.4 60.5 61.4         93.0     25.1 39.5
                       Context MT (Contexts)           65.1 65.3 65.2         88.9     35.8 51.0
                       BERT (Sentences)                66.6 61.6 64.0         87.0     40.2 54.9
                       BERT (Contexts)                 81.6 58.5 68.1         92.6     58.5 71.7
                       BERT (Contexts) + Data          90.4 95.2 92.7         95.0     89.8 92.3
       Possessives     Baseline MT (Sentences)         58.2 77.8 66.6         86.7     30.4 45.0
                       Baseline MT (Contexts)          63.6 63.4 63.5         87.2     35.0 50.0
                       Context MT (Contexts)           64.2 67.6 65.9         86.4     38.0 52.8
                       BERT (Contexts) + Data          87.4 91.3 89.3         90.9     86.8 88.8

          Table 6: Intrinsic evaluation of pronoun gender classification over our new test sets.

              Count     Prodrop      Rate              on WMT’13 Spanish→English data,10 pre-
        he     1,220      1,106     90.7%              processed using the Moses toolkit.11 Rows
        she      314        251     79.9%              designated Baseline in Table 6 are trained
                                                       with the standard sentence-level formulation
Table 7: Even accounting for difference in the raw     of the task and yield a best test set BLEU
number of pronouns by gender, prodrop is more          score (Papineni et al., 2002) of 34.02. Con-
likely for masculine entities than feminine entities   text MT is trained using the 2+1 strategy of
in WMT’13 (development).                               (Tiedemann and Scherrer, 2017) where an ex-
                                                       tended context, in our case full sentences up to
                                                       128 subtokens,12 is used to predict the trans-
4     Classifying Pronoun Gender
                                                       lation of the contextualized target sentence.
Language       model     (LM)      pretraining         This modeling yields 33.99 BLEU, which is not
(Peters et al., 2018; Devlin et al., 2019)             substantially different to Baseline MT, consis-
is emerging as an essential technique for              tent with Tiedemann and Scherrer.
human-quality language understanding. How-                To evaluate how well these NMT systems
ever, LMs are limited to learning from what            model pronoun gender, we use them to trans-
is explicitly expressed in text, which seems           late the Spanish text from our new dataset into
insufficient for Spanish given that gender is          English. For each target Spanish pronoun, we
often not textually realized for pronouns.             use an implementation of Vogel et al. (1996)
We show that BERT, a state-of-the-art                  to find the corresponding English pronoun and
pretrained-LM model (Devlin et al., 2019),             count the translation as correct if the gender
already models some aspects of pronoun                 tag on the Spanish agrees with the English
gender, achieving scores above a strong NMT            pronoun gender. Table 6 shows the results
baseline. We then show that fine-tuning this           when the input units are single sentences (Sen-
LM using our new data improves pronoun                 tences) or full sentences up to 128 subtokens
gender discrimination by over 20% absolute             (Contexts). All context sentences precede the
F1 for both genders.                                   one containing the pronoun (no lookahead).
                                                          Performance is weak overall but especially
4.1    Neural Machine Translation                      on feminine examples of prodrop. The ten-
We take as our NMT baseline a Transformer              dency is for masculine pronouns to be over-
(Vaswani et al., 2017) with vocabulary size              10
                                                             http://www.statmt.org/wmt13/translation-task.html
32,000 trained for up to 250k steps. We                  11
                                                             https://github.com/moses-smt/mosesdecoder
set input dimension to 512 and use 6 lay-                 12
                                                             Given this very local context, we make the assump-
ers with 8 attention heads and hidden di-              tion that adjacent sentences in the pre-shuffled dataset
                                                       are document-continuous. This assumption introduces
mension 2048, dropout probabilities of 0.1             noise at document boundaries, which occur much less
and learning rate 2.0. We train this model             frequently than document continuation transitions.
used: feminine pronouns are used by Baseline          Input: Britney Jean Spears (McComb, Mis-
MT (Sentences) less than one time in five when        isipi; 2 de diciembre de 1981), conocida como
they should be. To understand why this effect         Britney Spears, es una cantante, bailarina,
is so strong, we analyzed the development set         compositora, modelo, actriz, diseñadora de
of WMT’13 (Table 7). Specifically, we counted         modas y empresaria estadounidense. Comenzó
the number of times the English pronouns he           a actuar desde niña, a través de papeles en
and she occurred (first column). Not surpris-         producciones teatrales. *** adquirió fama du-
ingly, there was an over-representation of mas-       rante su niñez al participar en el programa de
culine pronouns; for each feminine example in         televisión The Mickey Mouse Club (1992).
the data, there was four masculine examples.          MaskLM Output: spears (0.61), britney
Next, we looked in the aligned Spanish sen-           (0.23), tambien (0.06), ella (0.03), rapida-
tences for the corresponding pronouns él and         mente (0.01)...
ella and counted a case of prodrop each time          Prediction: Feminine
they were missing (second column). Even ac-           Figure 2: Example of how the BERT pretrained
counting for differences in raw frequency by          model (Devlin et al., 2019) is evaluated on gen-
gender, there is a difference in which prodrop        dered pronoun prediction. The MaskLM predicts
is more frequent for masculine entities. This is      a word for the wildcard position. The feminine
the first time this difference has been reported      ella is found near the top of the k-best list, so the
                                                      prediction is Feminine.
and future work should control for the prior it
induces.
   Introducing context in the input to an MT          gender prediction where neither pronoun ap-
model, either trained on sentences or contexts,       pears in the top ten mask fills. The results
improves overall performance by enhancing             of this probing is shown in the BERT rows of
feminine recall and masculine precision at the        Table 6, which use input feeds of one sentence
expense of masculine recall. One possible rea-        (Sentences) and up to 128 subtokens (Con-
son is that the added context includes the rel-       texts).
evant cues for breaking out of the prior for             Despite scanning the full top ten mask fills,
masculine pronouns. The magnitude of change           we can see that the strength of BERT is in
in F1 score is perhaps beyond expected given          its strong precision rather than boosts to re-
that Baseline MT and Context MT achieve               call. While not directly comparable to MT val-
similar BLEU scores. However, this finding            ues, which are generated single-shot, we can
builds on the arguments in Cifka and Bojar            see that BERT values are stronger, particu-
(2018), which show that BLEU score does not           larly for feminine examples which are handled
correlate perfectly with semantics.                   poorly by machine translation.
4.2   Language Model Pretraining
                                                      4.3   BERT Classification Task
We initialize our LM experiments from a
BERTBASE model (110M parameters; see                  We extend this understanding of pronoun gen-
Devlin et al. (2019)) trained over Wikipedia          der using the fine-tuning recipe given in the
from 102 languages including Spanish. To as-          original BERT paper. Specifically, we gener-
sess how much this model knows about pro-             ate classification examples where the output
noun gender out-of-the-box, we extend the             is a gender tag (masculine or feminine) and
masked LM approach from the original BERT             the input text is up to 128 subtokens includ-
work, inserting a mask token in each dropped          ing special  tokens to mark the target pro-
subject position (see Figure 2 for an exam-           noun position (as in Table 8). We train over
ple).13 To generate gender predictions, we de-        both prodrop and possessives examples from
duce masculine if él appears before ella in the      the training portion of our new dataset.
list of mask fills and vice versa; we make no            Table 6 shows that this approach is extraor-
                                                      dinarily powerful for modeling pronoun gender
   13
      This is applied to prodrop only. No analogous   in our dataset. We note that the performance
probing may be done for the neutral Spanish posses-
sive pronouns without potentially semantic-breaking   we see here is remarkably similar to human
re-writing.                                           ratings of quality for the dataset.
(1)      Comenzo      a actuar   desde niña,     a traves de papeles en producciones teatrales.
           S/he started to act     as a child.FEM, through      roles   in theatrical productions.
            Adquirio        fama durante su      niñez.
           S/he acquired           fame in      his/her childhood.

                          Before Fine-tuning                 After Fine-tuning
                       Class Token        Weight         Class Token        Weight
                      *Masc. durante         0.01        Fem. niña            0.29
                              traves        -0.01               Comenzo       -0.10
                              actuar        -0.01               niñez         0.09
                              Comenzo        0.01               papeles       -0.01

Table 8: LIME analysis for a representative Spanish text in which all third-person singular pronouns are
either dropped or genderless. An English gloss, including grammatical gender marking, is given above.

   On all metrics and in both genders, we see            coreference for translation (Voita et al., 2018;
leaps of 20 points and higher. There is mini-            Webster et al., 2018), a fine-tuned BERT clas-
mal performance difference across the two gen-           sifier appears to implicitly learn coreference to
ders; precision and recall do not need to trade          make gender decisions here.
off against one another. Our novel method
which allows implicit pronoun gender to be               5     Improving Machine Translation
made explicit and amenable to text-level mod-                  with Classification Tags
eling is very well suited to pronoun gender              Given the strong intrinsic evaluation, we use
classification.                                          the gender classification tags directly in a stan-
                                                         dard NMT architecture to introduce implicit
4.4      Analysis                                        contextual awareness while still having trans-
We use the LIME toolkit14 (Ribeiro et al.,               lation parallelizable across sentences. This fol-
2016) to understand how the fine-tuning recipe           lows precedent set in Johnson et al. (2017) and
yields such a strong model of pronoun gender.            Zhou et al. (2019) for controlling translations
LIME is built for interpretability, learning lin-        via tag-based annotations. Our simple ap-
ear models which approximate heavy-weight                proach yields substantial improvements in pro-
architectures over features which are intuitive          noun translation quality.
to humans. By analyzing the highly-weighted
                                                         Baseline MT + Gender Tags We use
features in these approximations, we can learn
                                                         the Transformer model specification above but
what factors are important in the classification
                                                         train with WMT’13 sentence pairs augmented
decisions of the heavy-weight models.
                                                         with gender tags. For each Spanish sentence,
   Table 8 shows the top tokens in the LIME
                                                         we classify whether it contains a third-person
analysis of our gender classifier, both at the
                                                         singular dropped subject or possessive pro-
start and end of fine-tuning, for a represen-
                                                         noun using heuristics over morphological anal-
tative Spanish text, shown with its English
                                                         ysis, part of speech tags, and syntactic depen-
gloss. Before fine-tuning, the classification de-
                                                         dency structure (Andor et al., 2016).15 For
cision is incorrect and importance is low and
                                                         those containing a target pronoun, we concate-
spread fairly evenly across all terms in con-
                                                         nate the gender tag predicted by BERT (Con-
text. After fine-tuning, a correct classification
                                                         texts) + Data, after the sentence,16 following
is made, primarily with reference to the niña
                                                         a special  context token. For example:
token, which is the feminine form of child and
refers to a young Britney Spears, the dropped                (2)   Adquirió fama durante su niñez al par-
subject of Adquirió. That is, just as the                 15
                                                              Dropped pronouns occur where a ROOT verb has
original Transformer model implicitly models             no nsubj dependent and no noun to its left; possessive
                                                         pronouns are tagged with a label ending in $.
  14                                                       16
       https://github.com/marcotcr/lime                       To preserve positional encoding.
Model                       Translation      Classifier                 Masculine                Feminine
                                Input           Input       BLEU      P     R      F1       P        R     F1
 Baseline MT                 Sentences        -              34.02   95.2 97.1 96.2        69.7     57.5 63.0
    + Gender Tags            Sentences        Contexts      34.12    96.8 96.2 96.5        70.0     73.7 71.8
 Context MT                  Contexts         -              33.99   97.6 94.7 96.1        63.2     80.0 70.6

  Table 9: BLEU score and pronoun accuracy by gender on the WMT’13 Spanish-to-English test set.

          ticipar en el programa de televisión             work could look at methods for learning pre-
          The Mickey Mouse Club (1992).                  trained language models from extended docu-
                                                       ment context.
   For robustness, we introduce noise by ran-               6   Conclusion
domly flipping whether the sentence contains
a target pronoun 2% of the time17 and gener-                In this paper, we introduce a cross-lingual
ating a random gender tag 5% of the time.                   pivoting technique which generates large vol-
   We choose this strategy as a simple exten-               umes of labeled data for pronoun classifica-
sion of WMT which is fit for our purposes.                  tion without explicit human annotation. We
Specifically, we seek to show that our gender               demonstrate improvements on translating pro-
classification model captures implicit pronoun              nouns from Spanish to English using a BERT-
gender well and that its understanding is use-              based classifier fine-tuned on this data, which
ful for the translation of ambiguous pronouns.              achieves F1 of 92% for feminine pronouns,
                                                            compared with 30-51% for neural machine
Results We evaluate translation perfor-                     translation models and 54-71% for non-fine-
mance using BLEU score and pronoun                          tuned BERT. We show that our classifier can
generation quality using F1 score disag-                    be incorporated into a standard sentence-level
gregated by gender over gendered pro-                       neural machine translation system to yield
nouns.     We do not report APT score18                     8.8% F1 score improvement in feminine pro-
(Miculicich Werlen and Popescu-Belis, 2017b)                noun generation.
since we target few pronouns and make conclu-                  Our data creation method is largely
sions at the level of gender, not pronoun.                  language-agnostic and may be extended in fu-
   Table 9 shows that incorporating contex-                 ture work to new language pairs, or even dif-
tual awareness, either from concatenating full              ferent grammatical features. Further in this
sentences or gender tags in translation input,              direction, we are excited by the prospect of
improves pronoun quality markedly. Mascu-                   zero-shot pronoun gender classification, par-
line performance is uniformly strong and our                ticularly in low-resource languages.
new model, Baseline MT + Gender Tags, gives
the best performance on feminine pronouns,                  Acknowledgements
achieving a 8.8% F1 score improvement com-
pared to Baseline MT. As for the contextual                 We would like to thank Melvin Johnson,
MT models analyzed in the previous section,                 Romina Stella, and Macduff Hughes for assis-
the improvement is from addressing the low                  tance with machine translation work, as well
recall over feminine pronouns in sentence-level             as Mike Collins, Slav Petrov, and other mem-
translation (while retaining precision). We                 bers of the Google Research team for their
manually reviewed error cases and noted the                 feedback and suggestions on earlier versions of
major class was from domain shift: where an-                this manuscript.
tecedents were close to pronouns in Wikipedia,
there were more frequent cases in WMT of                    References
antecedents being outside the BERT context
window, and these were misclassified. Future                Daniel Andor, Chris Alberti, David Weiss, Aliaksei
                                                              Severyn, Alessandro Presta, Kuzman Ganchev,
 17
      In which case, input contains no special  token.     Slav Petrov, and Michael Collins. 2016.
 18
      https://github.com/idiap/APT                            Globally normalized transition-based neural networks.
In Proceedings of the 54th Annual    Meeting of     Melvin Johnson, Mike Schuster, Quoc V.
  the Association for Computational    Linguistics      Le, Maxim Krikun, Yonghui Wu, Zhifeng
  (Volume 1: Long Papers), pages       2442–2452,       Chen, Nikhil Thorat, Fernanda Viégas,
  Berlin, Germany. Association for      Computa-        Martin Wattenberg, Greg Corrado, Mac-
  tional Linguistics.                                   duff Hughes, and Jeffrey Dean. 2017.
                                                        Google’s multilingual neural machine translation system: Enabl
Rachel Bawden,        Rico Sennrich,      Alexan-       Transactions of the Association for Computa-
  dra Birch, and Barry Haddow. 2018.                    tional Linguistics, 5:339–351.
  Evaluating discourse phenomena in neural machine translation.
  In Proceedings of the 2018 Conference of the        Shaohui     Kuang,        Deyi     Xiong,       Wei-
  North American Chapter of the Association for         hua Luo,        and Guodong Zhou. 2018.
  Computational Linguistics: Human Language             Modeling coherence for neural machine translation with dynami
  Technologies, Volume 1 (Long Papers), pages           In Proceedings of the 27th International Con-
  1304–1313, New Orleans, Louisiana. Association        ference on Computational Linguistics, pages
  for Computational Linguistics.                        596–606, Santa Fe, New Mexico, USA. Associa-
                                                        tion for Computational Linguistics.
Ondřej Bojar, Christian Buck, Chris Callison-
  Burch, Christian Federmann, Barry Had-              Samuel Läubli, Rico Sennrich, and Martin Volk.
  dow, Philipp Koehn, Christof Monz, Matt               2018. Has machine translation achieved human
  Post, Radu Soricut, and Lucia Specia. 2013.           parity? a case for document-level evaluation. In
  Findings of the 2013 Workshop on Statistical MachineProceedings
                                                          Translation.
                                                                     of the 2018 Conference on Empirical
  In Proceedings of the Eighth Workshop on Sta-         Methods in Natural Language Processing, pages
  tistical Machine Translation, pages 1–44,             4791–4796.
  Sofia, Bulgaria. Association for Computational
  Linguistics.                                        Pierre Lison and Jrg Tiedemann. 2016. OpenSub-
                                                        titles2016: Extracting Large Parallel Corpora
Ondrej Cifka and Ondrej Bojar. 2018.                    from Movie and TV Subtitles. In Proceedings
  Are BLEU and Meaning Representation in Opposition?    of the Tenth International Conference on Lan-
  In Proceedings of the 56th Annual Meeting of          guage Resources and Evaluation (LREC’16).
  the Association for Computational Linguistics
  (Volume 1: Long Papers), pages 1362–1371,           Sharid Loáiciga, Sara Stymne, Preslav Nakov,
  Melbourne, Australia. Association for Compu-          Christian Hardmeier, Jörg Tiedemann, Mauro
  tational Linguistics.                                 Cettolo, and Yannick Versley. 2017. Findings of
                                                        the 2017 DiscoMT shared task on cross-lingual
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and           pronoun prediction. In The Third Workshop on
  Kristina Toutanova. 2019. Bert: Pre-training          Discourse in Machine Translation.
  of deep bidirectional transformers for language
  understanding. In Proceedings of the 2019 Con-      Lesly     Miculicich       Werlen      and      An-
  ference of the North American Chapter of the          drei             Popescu-Belis.            2017a.
  Association for Computational Linguistics: Hu-        Using coreference links to improve spanish-to-english machine tr
  man Language Technologies, Volume 1 (Long             In Proceedings of the 2nd Workshop on Corefer-
  and Short Papers), pages 4171–4186.                   ence Resolution Beyond OntoNotes (CORBON
                                                        2017), pages 30–40, Valencia, Spain. Association
Liane Guillou and Christian Hardmeier. 2018.            for Computational Linguistics.
  Automatic reference-based evaluation of pronoun translation misses the point.
  In Proceedings of the 2018 Conference on Em-        Lesly Miculicich Werlen and Andrei Popescu-
  pirical Methods in Natural Language Processing,       Belis. 2017b. Validation of an Automatic Met-
  pages 4797–4802, Brussels, Belgium. Associa-          ric for the Accuracy of Pronoun Translation
  tion for Computational Linguistics.                   (APT). In Proceedings of the Third Work-
                                                        shop on Discourse in Machine Translation (Dis-
Nizar       Habash,        Houda        Bouamor,        coMT), EPFL-CONF-229974. Association for
  and         Christine       Chung.        2019.       Computational Linguistics (ACL).
  Automatic gender identification and reinflection in Arabic.
  In Proceedings of the First Workshop on Gen-        Amit         Moryossef,           Roee         Aha-
  der Bias in Natural Language Processing,              roni,      and      Yoav      Goldberg.      2019.
  pages 155–165, Florence, Italy. Association for       Filling gender & number gaps in neural machine translation wit
  Computational Linguistics.                            In Proceedings of the First Workshop on Gen-
                                                        der Bias in Natural Language Processing,
Nizar Habash, Nasser Zalmout, Dima Taji,                pages 49–54, Florence, Italy. Association for
  Hieu Hoang, and Maverick Alzate. 2017.                Computational Linguistics.
  A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages.
  In Proceedings of the 15th Conference of the        Mathias     Muller,     Annette     Rios,     Elena
  European Chapter of the Association for Com-          Voita,      and      Rico      Sennrich.     2018.
  putational Linguistics: Volume 2, Short Papers,       A large-scale test set for the evaluation of context-aware pronou
  pages 235–241, Valencia, Spain. Association for       In Proceedings of the Third Conference on Ma-
  Computational Linguistics.                            chine Translation: Research Papers, pages
61–72, Belgium, Brussels. Association for           Eva      Vanmassenhove,        Christian     Hard-
  Computational Linguistics.                            meier,        and      Andy      Way.      2018.
                                                        Getting gender right in neural machine translation.
Kishore Papineni,       Salim Roukos,           Todd    In Proceedings of the 2018 Conference on Em-
  Ward,       and      Wei-Jing      Zhu.       2002.   pirical Methods in Natural Language Processing,
  Bleu: a method for automatic evaluation of machine translation.
                                                        pages 3003–3008, Brussels, Belgium. Associa-
  In Proceedings of 40th Annual Meeting of the          tion for Computational Linguistics.
  Association for Computational Linguistics,
  pages 311–318, Philadelphia, Pennsylvania,          Ashish Vaswani, Noam Shazeer, Niki Parmar,
  USA. Association for Computational Linguis-           Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
  tics.                                                 ukasz Kaiser, and Illia Polosukhin. 2017. Atten-
                                                        tion is all you need. In Proceedings of NIPS,
Matthew E. Peters, Mark Neumann, Mohit Iyyer,           pages 6000–6010, Long Beach, California.
  Matt Gardner, Christopher Clark, Kenton Lee,
  and Luke Zettlemoyer. 2018. Deep contextual-        Stephan Vogel, Hermann Ney, and Christoph Till-
  ized word representations. In Proceedings of the      mann. 1996. HMM-based word alignment in sta-
  North American Chapter of the Association of          tistical translation. In Proceedings of the 16th
  Computational Linguistics.                            conference on Computational linguistics. Asso-
                                                        ciation for Computational Linguistics.
Marco Tulio Ribeiro, Sameer Singh, and Carlos
  Guestrin. 2016. ”why should I trust you?”: Ex-      Elena     Voita,     Pavel     Serdyukov,     Rico
  plaining the predictions of any classifier. In Pro-   Sennrich,       and     Ivan     Titov.    2018.
  ceedings of the 22nd ACM SIGKDD Interna-              Context-aware neural machine translation learns anaphora resol
  tional Conference on Knowledge Discovery and          In Proceedings of the 56th Annual Meeting of
  Data Mining, San Francisco, CA, USA, August           the Association for Computational Linguistics
  13-17, 2016, pages 1135–1144.                         (Volume 1: Long Papers), pages 1264–1274,
                                                        Melbourne, Australia. Association for Compu-
Karin          Sim            Smith.            2017.   tational Linguistics.
  On integrating discourse in machine translation.
  In Proceedings of the Third Workshop on Dis-        Claudia Wagner, Eduardo Graells-Garrido, David
  course in Machine Translation, pages 110–121,         Garcia, and Filippo Menczer. 2016. Women
  Copenhagen, Denmark. Association for Com-             through the glass ceiling: Gender assymetrics
  putational Linguistics.                               in Wikipedia. EPJ Data Science, 5.

Jason R. Smith, Herve Saint-Amand, Mag-             Longyue          Wang,         Zhaopeng         Tu,
   dalena Plamada, Philipp Koehn, Chris                Andy      Way,     and     Qun     Liu.    2017.
   Callison-Burch, and Adam Lopez. 2013.               Exploiting cross-sentence context for neural machine translation
   Dirt Cheap Web-Scale Parallel Text from the Common  InCrawl.
                                                          Proceedings of the 2017 Conference on Em-
   In Proceedings of the 51st Annual Meeting of        pirical Methods in Natural Language Processing,
   the Association for Computational Linguistics       pages 2826–2831, Copenhagen, Denmark. Asso-
   (Volume 1: Long Papers), pages 1374–1383,           ciation for Computational Linguistics.
   Sofia, Bulgaria. Association for Computational
   Linguistics.                                     Longyue          Wang,         Zhaopeng         Tu,
                                                       Andy      Way,     and     Qun     Liu.    2018.
Dario Stojanovski and Alexander Fraser. 2018.          Learning to jointly translate and predict dropped pronouns with
   Coreference and coherence in neural machine         In Proceedings of the 2018 Conference on Em-
   translation: A study using oracle experiments.      pirical Methods in Natural Language Processing,
   In Proceedings of the Third Conference on Ma-       pages 2997–3002, Brussels, Belgium. Associa-
   chine Translation: Research Papers, pages 49–       tion for Computational Linguistics.
   60.
                                                    Kellie Webster, Marta Recasens, Vera Axelrod,
Jörg Tiedemann. 2012. Parallel data, tools and        and Jason Baldridge. 2018. Mind the GAP: A
   interfaces in OPUS. In Proceedings of the           Balanced Corpus of Gendered Ambiguous Pro-
   Eight International Conference on Language Re-      nouns. In Transactions of the ACL, page to ap-
   sources and Evaluation (LREC’12), Istanbul,         pear.
   Turkey. European Language Resources Associ-
   ation (ELRA).                                    Jiacheng Zhang, Huanbo Luan, Maosong
                                                       Sun,       Feifei    Zhai,     Jingfang     Xu,
Jörg Tiedemann and Yves Scherrer. 2017.               Min     Zhang,     and     Yang    Liu.    2018.
   Neural machine translation with extended context. Improving the transformer translation model with document-lev
   In Proceedings of the Third Workshop on Dis-        In Proceedings of the 2018 Conference on Em-
   course in Machine Translation, pages 82–92,         pirical Methods in Natural Language Processing,
   Copenhagen, Denmark. Association for Com-           pages 533–542, Brussels, Belgium. Association
   putational Linguistics.                             for Computational Linguistics.
Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-
  Hao Huang, Muhao Chen, Ryan Cot-
  terell,    and     Kai-Wei    Chang.     2019.
  Examining gender bias in languages with grammatical gender.
  In Proceedings of the 2019 Conference on Em-
  pirical Methods in Natural Language Processing
  and the 9th International Joint Conference
  on Natural Language Processing (EMNLP-
  IJCNLP), pages 5276–5284, Hong Kong, China.
  Association for Computational Linguistics.
Micha Ziemski, Marcin Junczys-Dowmunt, and
  Bruno Pouliquen. 2016. The United Nations
  Parallel Corpus v1.0. In Proceedings of the 10th
  edition of the Language Resources and Evalua-
  tion Conference.
You can also read