Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scalable Cross Lingual Pivots to Model Pronoun Gender for
Translation
Kellie Webster and Emily Pitler
Google Research
{websterk|epitler}@google.com
Abstract Spanish Britney Jean Spears... ∅ Adquirió
fama durante su niñez al participar en el
arXiv:2006.08881v1 [cs.CL] 16 Jun 2020
Machine translation systems with inade- programa de televisión The Mickey Mouse
quate document understanding can make
Club (1992).
errors when translating dropped or neu-
tral pronouns into languages with gendered
Spanish→English He gained fame during
pronouns (e.g., English). Predicting the his childhood by participating in the televi-
underlying gender of these pronouns is dif- sion program The Mickey Mouse Club (1992).
ficult since it is not marked textually and
must instead be inferred from coreferent
Figure 1: Typical example of pronoun gender er-
mentions in the context. We propose a
rors when translating from pronouns without tex-
novel cross-lingual pivoting technique for
tual gender marking (e.g., dropped or possessive
automatically producing high-quality gen-
pronouns in Spanish) to a language with gendered
der labels, and show that this data can be
pronouns.
used to fine-tune a BERT classifier with
92% F1 for Spanish dropped feminine pro-
nouns, compared with 30-51% for neural
machine translation models and 54-71% for Such errors are worrying for their preva-
a non-fine-tuned BERT model. We aug- lence. An oracle study found that augment-
ment a neural machine translation model ing an NMT system with perfect pronoun in-
with labels from our classifier to improve
formation could give up to 6 BLEU points
pronoun translation, while still having par-
allelizable translation models that trans-
improvement (Stojanovski and Fraser, 2018).
late a sentence at a time. Further, the DiscoMT shared task on pronoun
prediction tested multiple language pairs and
1 Introduction found that the most difficult language pair by
far was Spanish→English and that “[t]his re-
While neural modeling has solved many chal-
sult reflects the difficulty of translating null
lenges in machine translation, a number
subjects” (Loáiciga et al., 2017).
remain, notably document-level consistency.
Since standard architectures are sentence- We address document-level modeling for
based, they can misrepresent pronoun gender dropped and neutral personal pronouns in
since cues for this are often extra-sentential Spanish. We significantly improve the ability
(Läubli et al., 2018). For example, in Fig- of a strong sentence-level translator to produce
ure 1,1 there is a dropped subject and a neu- correctly gendered pronouns in three stages.
tral possessive pronoun in the Spanish, which First, we automatically collect high-quality
are both translated with masculine pronouns pronoun gender-tagged examples at scale us-
in English despite context indicating they re- ing a novel cross-lingual pivoting technique
fer to Britney Jean Spears. Recommendations (Section 3). This technique easily scales to
for the use of discourse in translation are de- produce large amounts of data without explicit
veloped in Sim Smith (2017). human labeling, while remaining largely lan-
1
guage agnostic. This means our technique may
Source: https://es.wikipedia.org/wiki/Britney_Spears,
Translation: Retrieved from translate.google.com, also be easily scaled to produce data for other
27 Feb 2019. languages than Spanish; we present results onProperties Surface Form sentence cannot be used to make translation
En Es decisions. In a language like Spanish, this
Nominative, Masc. he él or ∅ means that critical cues for pronoun gender
Nominative, Fem. she ella or ∅ are simply not present at decoding time.
Possessive, Masc. his su There have been a number of direc-
Possessive, Fem. her su tions exploring how to introduce gen-
eral contextual awareness into translation.
Table 1: Masculine and feminine pronouns cannot Tiedemann and Scherrer (2017) investigated a
be distinguished by their surface forms in Spanish simple technique of providing multiple sen-
in both the dropped and possessive case. tences, concatenated together, to a translator
and learning a model to translate the final sen-
tence in the stream using the preceding one(s)
Spanish as a proof of concept.
for contextualization. Unfortunately, no sub-
Next, we show how to fine-tune BERT stantial performance gain was seen. Next, the
(Devlin et al., 2019), a state-of-the-art pre- Transformer model (Vaswani et al., 2017) was
trained language model, to classify pronoun extended by introducing a second, context de-
gender using our new dataset (Section 4). Fi- coder (Wang et al., 2017; Zhang et al., 2018).
nally, we add context tags from our BERT Complementary to this, Kuang et al. (2018)
model into standard MT sentence input, yield- introduces two caches, one lexical and the
ing a 8.8% F1 improvement in feminine pro- other topical, to store information from con-
noun generation (Section 5). text pertinent to translating future sentences.
2 Background Specific to the translation of implicit or
dropped pronouns, Wang et al. (2018) shows
In this section, we overview particular nuances that modeling pronoun prediction jointly
associated with pronoun understanding, re- with translation improves quality, while
sulting challenges for translation, and existing Miculicich Werlen and Popescu-Belis (2017a)
relevant datasets. presents a proof-of-concept where coreference
links are available at decoding time. Most sim-
Dropped (Implicit) Pronouns Spanish
ilar to our current work, Habash et al. (2019);
personal pronouns have less textual mark-
Moryossef et al. (2019); Vanmassenhove et al.
ing of gender than their English counter-
(2018) improve gender fidelity in translations
parts. Where they may be inferred from con-
in of first-person singular pronouns in lan-
text, Spanish drops subject pronouns él/ella
guages with grammatical gender inflection
(he/she). Several other languages such as Chi-
in this position. They show promising re-
nese, Japanese, Greek, and Italian also have
sults from either injecting explicit information
dropped pronouns.
about speaker gender, or predicting it with a
Possessive Pronouns Additionally, Span- classifier. We extend this line of work without
ish possessive pronouns do not mark the gen- requiring human labels to train our classifier.
der of the possessor, with su(s) used where
Prior Work on Aligned Corpora Trans-
English uses his and her. In other Romance
lation of sentences has been a task at the
languages such as French, the possessive pro-
Workshop on Machine Translation (WMT)
noun agrees with the gender of the possessed
since 2006. Over this period, more and larger
object, rather than the possessor as in English.
datasets have been released for training statis-
These usages are summarized for Spanish in
tical models. Typically released data is given
Table 1 and illustrated in Figure 1.
in sentence-pair form, meaning that for each
MT of Ambiguous Pronouns Document- source language sentence (e.g. Spanish), there
level consistency is an outstanding challenge is a target language sentence (e.g. English)
for machine translation (Läubli et al., 2018). which may be considered its true translation.
One reason is that standard techniques trans- For a discussion of the adequacy of this ap-
late the sentences independently of one an- proach, see Guillou and Hardmeier (2018).
other, meaning that context outside a given Translation of Spanish sentences to En-(A) Page Alignment
Title: ”Mitsuko Shiga”
https://en.wikipedia.org/wiki/Mitsuko_Shiga
https://es.wikipedia.org/wiki/Mitsuko_Shiga
(B) Sentence Alignment
Spanish Publicó numerosas antologı́as de su poesa durante su vida, in-
cluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki.
Spanish→English He published numerous anthologies of his poetry during his life,
including Fuji no Mi, Asa Tsuki, Asa Ginu, and Kamakura Zakki.
English She published numerous anthologies of her poetry during her life-
time, including Fuji no Mi (”Wisteria Beans”), Asa Tsuki (”Morn-
ing Moon”), Asa Ginu (”Linen Silk”), and Kamakura Zakki (”Ka-
makura Miscellany”).
(C) Pronoun Tagging ∅/She Publicó numerosas antologas de su/her poesa durante
su/her vida, incluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y
Kamakura Zakki.
Table 2: Example of how pronoun translations are extracted from comparable texts. The English and
Spanish sentences were extracted from Wikipedia articles about the same subject, the sentences were
aligned due to the similarities in their translations, and the labels were extracted by matching the Spanish
dropped and ambiguous pronouns with the English pronouns.
glish was most recently contested at WMT’132 second person pronouns and noisy alignment.
(Bojar et al., 2013), which offered partici- High quality targeted datasets for discourse
pants 14,980,513 sentence pairs from Eu- phenomena were explored in Bawden et al.
roparl3 (Habash et al., 2017), Common Crawl (2018), however the released corpus was a
(Smith et al., 2013), the United Nations cor- small French→English test set. In this paper,
pus4 (Ziemski et al., 2016), and news com- we introduce a novel cross-lingual technique
mentary. While large, this dataset wasn’t for gathering a large-scale targeted dataset
designed for document-level understanding. without explicit human annotation.
Document boundaries are available, but cor-
respond loosely to what a user of a translation 3 Cross-Lingual Pivoting for
system might consider a document, whole days Pronoun Gender Data Creation
of European parliament, for instance.
Despite document-level context being critical
DiscoMT was introduced as a workshop in
for pronoun translation quality, little of the
20135 to target the translation of pronouns
available aligned data is directly suitable for
in document context. As well as Europarl,
training models. In this section, we exploit the
Ted talk subtitles6 (Tiedemann, 2012) were in-
special properties of Wikipedia to automate
cluded for both training and evaluation. Also
the collection of high-quality gender-tagged
available for the translation of extended speech
pronouns in document context.
is Open Subtitles7 (Lison and Tiedemann,
2016; Muller et al., 2018), which contains ut- 3.1 Cross-Lingual Pivoting
terance pairs which are aligned and contextu-
Wikipedia has a number of desirable proper-
alized according to their temporal appearance
ties which we exploit to automatically tag pro-
in a film. Although subtitle datasets make
noun gender. Firstly, it is very large for head
available very wide context, they are not ideal
languages like English and Spanish. More im-
in this study due to their heavy use of first and
portantly, many non-English Wikipedia pages
2 express similar content to their English coun-
http://www.statmt.org/wmt13/translation-task.html
3
4
http://www.statmt.org/europarl/ terpart given the resource’s style and format-
https://cms.unov.org/UNCorpus/
5 ting requirements, and the ease of using trans-
http://aclweb.org/anthology/W13-3300
6
http://opus.nlpl.eu/TED2013.php lation systems to generate base text for a miss-
7
http://opus.nlpl.eu/OpenSubtitles.php ing non-English page. Based on these proper-Prodrop Possessive
Articles he she Articles his her
Spanish Wikipedia 1,326,469 - - 1,326,469 - -
Page Alignment 420,850 - - 420,850 - -
Pronoun Tagging 45,210 118,911 41,524 52,559 266,886 111,411
Table 3: Extraction yield in each stage of the multi-lingual pivot extraction pipeline.
ties, we develop the following three stage ex- Human
traction pipeline, illustrated in Table 2 with he she unclear
pipeline yield summarized in Table 3. Pipeline he 221 3 36
Page Alignment Identify as pairs pages she 7 179 43
with the same title in English and Spanish. We
prioritize precision and require exact match. Table 4: Agreement between automatic labels and
human judgment for prodrop examples.
Sentence Alignment Identify sentence
pairs in aligned pages which express nearly Human
the same meaning. To do this, we compare his her unclear
English translations of sentences8 from the
Pipeline his 198 6 48
Spanish page to sentences from the English
her 25 179 55
page. We do bipartite matching over these
English sentence-pairs, selecting pairings with Table 5: Agreement between automatic labels and
the smallest edit distance. We require that human judgment for possessive pronoun examples.
edit distance is maximally one half sentence
length, that paired sentences share either a
noun or verb, and that the mapping is at dataset with 79,240 prodrop and 187,224 pos-
most one-to-one. Table 2, step (B) shows an sessive examples. We split the 79,240 prodrop
aligned sentence pair. examples into training, development, and test
sets of size 72,120, 2,968, and 4,152, respec-
Pronoun Tagging Perform alignment9 tively; the 187,224 possessive examples are
over the tokens in detected sentence pairs, similarly split into sets of size 167,222, 8,862,
identifying where English uses a gendered and 11,140.
pronoun and Spanish has either a dropped
or possessive pronoun. Copy the gender Quality To assess the quality of our exam-
of the English pronoun as a label onto the ples, we sent 993 examples for human rat-
ambiguous target pronoun, masculine for he ing. In-house bilingual speakers were shown
and his, feminine for she and her. Step (C) in a paragraph of Spanish text containing a la-
Table 2 shows the resulting pronoun tagging beled pronoun and asked whether, given the
in the example. context, the pronoun referred to a masculine
person, feminine person, or whether it was un-
3.2 Final Dataset
clear. Agreement between these human labels
Gender We can see from Table 3 that there and those produced by our automatic pipeline
are approximately three times as many mas- is high. 84% of prodrop and 80% of posses-
culine examples extracted compared to femi- sive pronouns could be disambiguated with the
nine examples. This is not surprising given the provided context and, of those, 98% and 92%
disparity in representation by gender found in of automatically generated gender tags agreed
Wikipedia (Wagner et al., 2016). For the final with human judgment. Where the provided
dataset, we down-sample masculine examples paragraph was not enough, it was typically the
to achieve a 1:1 gender balance, yielding a final case that relevant context was in the directly
8
From an in-house tool. preceding paragraph. Confusion matrices are
9
We use an implementation of Vogel et al. (1996). given in Tables 4 and 5.Masculine Feminine
P R F1 P R F1
Prodrop Baseline MT (Sentences) 56.2 81.3 66.5 91.3 18.6 30.9
Baseline MT (Contexts) 62.4 60.5 61.4 93.0 25.1 39.5
Context MT (Contexts) 65.1 65.3 65.2 88.9 35.8 51.0
BERT (Sentences) 66.6 61.6 64.0 87.0 40.2 54.9
BERT (Contexts) 81.6 58.5 68.1 92.6 58.5 71.7
BERT (Contexts) + Data 90.4 95.2 92.7 95.0 89.8 92.3
Possessives Baseline MT (Sentences) 58.2 77.8 66.6 86.7 30.4 45.0
Baseline MT (Contexts) 63.6 63.4 63.5 87.2 35.0 50.0
Context MT (Contexts) 64.2 67.6 65.9 86.4 38.0 52.8
BERT (Contexts) + Data 87.4 91.3 89.3 90.9 86.8 88.8
Table 6: Intrinsic evaluation of pronoun gender classification over our new test sets.
Count Prodrop Rate on WMT’13 Spanish→English data,10 pre-
he 1,220 1,106 90.7% processed using the Moses toolkit.11 Rows
she 314 251 79.9% designated Baseline in Table 6 are trained
with the standard sentence-level formulation
Table 7: Even accounting for difference in the raw of the task and yield a best test set BLEU
number of pronouns by gender, prodrop is more score (Papineni et al., 2002) of 34.02. Con-
likely for masculine entities than feminine entities text MT is trained using the 2+1 strategy of
in WMT’13 (development). (Tiedemann and Scherrer, 2017) where an ex-
tended context, in our case full sentences up to
128 subtokens,12 is used to predict the trans-
4 Classifying Pronoun Gender
lation of the contextualized target sentence.
Language model (LM) pretraining This modeling yields 33.99 BLEU, which is not
(Peters et al., 2018; Devlin et al., 2019) substantially different to Baseline MT, consis-
is emerging as an essential technique for tent with Tiedemann and Scherrer.
human-quality language understanding. How- To evaluate how well these NMT systems
ever, LMs are limited to learning from what model pronoun gender, we use them to trans-
is explicitly expressed in text, which seems late the Spanish text from our new dataset into
insufficient for Spanish given that gender is English. For each target Spanish pronoun, we
often not textually realized for pronouns. use an implementation of Vogel et al. (1996)
We show that BERT, a state-of-the-art to find the corresponding English pronoun and
pretrained-LM model (Devlin et al., 2019), count the translation as correct if the gender
already models some aspects of pronoun tag on the Spanish agrees with the English
gender, achieving scores above a strong NMT pronoun gender. Table 6 shows the results
baseline. We then show that fine-tuning this when the input units are single sentences (Sen-
LM using our new data improves pronoun tences) or full sentences up to 128 subtokens
gender discrimination by over 20% absolute (Contexts). All context sentences precede the
F1 for both genders. one containing the pronoun (no lookahead).
Performance is weak overall but especially
4.1 Neural Machine Translation on feminine examples of prodrop. The ten-
We take as our NMT baseline a Transformer dency is for masculine pronouns to be over-
(Vaswani et al., 2017) with vocabulary size 10
http://www.statmt.org/wmt13/translation-task.html
32,000 trained for up to 250k steps. We 11
https://github.com/moses-smt/mosesdecoder
set input dimension to 512 and use 6 lay- 12
Given this very local context, we make the assump-
ers with 8 attention heads and hidden di- tion that adjacent sentences in the pre-shuffled dataset
are document-continuous. This assumption introduces
mension 2048, dropout probabilities of 0.1 noise at document boundaries, which occur much less
and learning rate 2.0. We train this model frequently than document continuation transitions.used: feminine pronouns are used by Baseline Input: Britney Jean Spears (McComb, Mis-
MT (Sentences) less than one time in five when isipi; 2 de diciembre de 1981), conocida como
they should be. To understand why this effect Britney Spears, es una cantante, bailarina,
is so strong, we analyzed the development set compositora, modelo, actriz, diseñadora de
of WMT’13 (Table 7). Specifically, we counted modas y empresaria estadounidense. Comenzó
the number of times the English pronouns he a actuar desde niña, a través de papeles en
and she occurred (first column). Not surpris- producciones teatrales. *** adquirió fama du-
ingly, there was an over-representation of mas- rante su niñez al participar en el programa de
culine pronouns; for each feminine example in televisión The Mickey Mouse Club (1992).
the data, there was four masculine examples. MaskLM Output: spears (0.61), britney
Next, we looked in the aligned Spanish sen- (0.23), tambien (0.06), ella (0.03), rapida-
tences for the corresponding pronouns él and mente (0.01)...
ella and counted a case of prodrop each time Prediction: Feminine
they were missing (second column). Even ac- Figure 2: Example of how the BERT pretrained
counting for differences in raw frequency by model (Devlin et al., 2019) is evaluated on gen-
gender, there is a difference in which prodrop dered pronoun prediction. The MaskLM predicts
is more frequent for masculine entities. This is a word for the wildcard position. The feminine
the first time this difference has been reported ella is found near the top of the k-best list, so the
prediction is Feminine.
and future work should control for the prior it
induces.
Introducing context in the input to an MT gender prediction where neither pronoun ap-
model, either trained on sentences or contexts, pears in the top ten mask fills. The results
improves overall performance by enhancing of this probing is shown in the BERT rows of
feminine recall and masculine precision at the Table 6, which use input feeds of one sentence
expense of masculine recall. One possible rea- (Sentences) and up to 128 subtokens (Con-
son is that the added context includes the rel- texts).
evant cues for breaking out of the prior for Despite scanning the full top ten mask fills,
masculine pronouns. The magnitude of change we can see that the strength of BERT is in
in F1 score is perhaps beyond expected given its strong precision rather than boosts to re-
that Baseline MT and Context MT achieve call. While not directly comparable to MT val-
similar BLEU scores. However, this finding ues, which are generated single-shot, we can
builds on the arguments in Cifka and Bojar see that BERT values are stronger, particu-
(2018), which show that BLEU score does not larly for feminine examples which are handled
correlate perfectly with semantics. poorly by machine translation.
4.2 Language Model Pretraining
4.3 BERT Classification Task
We initialize our LM experiments from a
BERTBASE model (110M parameters; see We extend this understanding of pronoun gen-
Devlin et al. (2019)) trained over Wikipedia der using the fine-tuning recipe given in the
from 102 languages including Spanish. To as- original BERT paper. Specifically, we gener-
sess how much this model knows about pro- ate classification examples where the output
noun gender out-of-the-box, we extend the is a gender tag (masculine or feminine) and
masked LM approach from the original BERT the input text is up to 128 subtokens includ-
work, inserting a mask token in each dropped ing special tokens to mark the target pro-
subject position (see Figure 2 for an exam- noun position (as in Table 8). We train over
ple).13 To generate gender predictions, we de- both prodrop and possessives examples from
duce masculine if él appears before ella in the the training portion of our new dataset.
list of mask fills and vice versa; we make no Table 6 shows that this approach is extraor-
dinarily powerful for modeling pronoun gender
13
This is applied to prodrop only. No analogous in our dataset. We note that the performance
probing may be done for the neutral Spanish posses-
sive pronouns without potentially semantic-breaking we see here is remarkably similar to human
re-writing. ratings of quality for the dataset.(1) Comenzo a actuar desde niña, a traves de papeles en producciones teatrales.
S/he started to act as a child.FEM, through roles in theatrical productions.
Adquirio fama durante su niñez.
S/he acquired fame in his/her childhood.
Before Fine-tuning After Fine-tuning
Class Token Weight Class Token Weight
*Masc. durante 0.01 Fem. niña 0.29
traves -0.01 Comenzo -0.10
actuar -0.01 niñez 0.09
Comenzo 0.01 papeles -0.01
Table 8: LIME analysis for a representative Spanish text in which all third-person singular pronouns are
either dropped or genderless. An English gloss, including grammatical gender marking, is given above.
On all metrics and in both genders, we see coreference for translation (Voita et al., 2018;
leaps of 20 points and higher. There is mini- Webster et al., 2018), a fine-tuned BERT clas-
mal performance difference across the two gen- sifier appears to implicitly learn coreference to
ders; precision and recall do not need to trade make gender decisions here.
off against one another. Our novel method
which allows implicit pronoun gender to be 5 Improving Machine Translation
made explicit and amenable to text-level mod- with Classification Tags
eling is very well suited to pronoun gender Given the strong intrinsic evaluation, we use
classification. the gender classification tags directly in a stan-
dard NMT architecture to introduce implicit
4.4 Analysis contextual awareness while still having trans-
We use the LIME toolkit14 (Ribeiro et al., lation parallelizable across sentences. This fol-
2016) to understand how the fine-tuning recipe lows precedent set in Johnson et al. (2017) and
yields such a strong model of pronoun gender. Zhou et al. (2019) for controlling translations
LIME is built for interpretability, learning lin- via tag-based annotations. Our simple ap-
ear models which approximate heavy-weight proach yields substantial improvements in pro-
architectures over features which are intuitive noun translation quality.
to humans. By analyzing the highly-weighted
Baseline MT + Gender Tags We use
features in these approximations, we can learn
the Transformer model specification above but
what factors are important in the classification
train with WMT’13 sentence pairs augmented
decisions of the heavy-weight models.
with gender tags. For each Spanish sentence,
Table 8 shows the top tokens in the LIME
we classify whether it contains a third-person
analysis of our gender classifier, both at the
singular dropped subject or possessive pro-
start and end of fine-tuning, for a represen-
noun using heuristics over morphological anal-
tative Spanish text, shown with its English
ysis, part of speech tags, and syntactic depen-
gloss. Before fine-tuning, the classification de-
dency structure (Andor et al., 2016).15 For
cision is incorrect and importance is low and
those containing a target pronoun, we concate-
spread fairly evenly across all terms in con-
nate the gender tag predicted by BERT (Con-
text. After fine-tuning, a correct classification
texts) + Data, after the sentence,16 following
is made, primarily with reference to the niña
a special context token. For example:
token, which is the feminine form of child and
refers to a young Britney Spears, the dropped (2) Adquirió fama durante su niñez al par-
subject of Adquirió. That is, just as the 15
Dropped pronouns occur where a ROOT verb has
original Transformer model implicitly models no nsubj dependent and no noun to its left; possessive
pronouns are tagged with a label ending in $.
14 16
https://github.com/marcotcr/lime To preserve positional encoding.Model Translation Classifier Masculine Feminine
Input Input BLEU P R F1 P R F1
Baseline MT Sentences - 34.02 95.2 97.1 96.2 69.7 57.5 63.0
+ Gender Tags Sentences Contexts 34.12 96.8 96.2 96.5 70.0 73.7 71.8
Context MT Contexts - 33.99 97.6 94.7 96.1 63.2 80.0 70.6
Table 9: BLEU score and pronoun accuracy by gender on the WMT’13 Spanish-to-English test set.
ticipar en el programa de televisión work could look at methods for learning pre-
The Mickey Mouse Club (1992). trained language models from extended docu-
ment context.
For robustness, we introduce noise by ran- 6 Conclusion
domly flipping whether the sentence contains
a target pronoun 2% of the time17 and gener- In this paper, we introduce a cross-lingual
ating a random gender tag 5% of the time. pivoting technique which generates large vol-
We choose this strategy as a simple exten- umes of labeled data for pronoun classifica-
sion of WMT which is fit for our purposes. tion without explicit human annotation. We
Specifically, we seek to show that our gender demonstrate improvements on translating pro-
classification model captures implicit pronoun nouns from Spanish to English using a BERT-
gender well and that its understanding is use- based classifier fine-tuned on this data, which
ful for the translation of ambiguous pronouns. achieves F1 of 92% for feminine pronouns,
compared with 30-51% for neural machine
Results We evaluate translation perfor- translation models and 54-71% for non-fine-
mance using BLEU score and pronoun tuned BERT. We show that our classifier can
generation quality using F1 score disag- be incorporated into a standard sentence-level
gregated by gender over gendered pro- neural machine translation system to yield
nouns. We do not report APT score18 8.8% F1 score improvement in feminine pro-
(Miculicich Werlen and Popescu-Belis, 2017b) noun generation.
since we target few pronouns and make conclu- Our data creation method is largely
sions at the level of gender, not pronoun. language-agnostic and may be extended in fu-
Table 9 shows that incorporating contex- ture work to new language pairs, or even dif-
tual awareness, either from concatenating full ferent grammatical features. Further in this
sentences or gender tags in translation input, direction, we are excited by the prospect of
improves pronoun quality markedly. Mascu- zero-shot pronoun gender classification, par-
line performance is uniformly strong and our ticularly in low-resource languages.
new model, Baseline MT + Gender Tags, gives
the best performance on feminine pronouns, Acknowledgements
achieving a 8.8% F1 score improvement com-
pared to Baseline MT. As for the contextual We would like to thank Melvin Johnson,
MT models analyzed in the previous section, Romina Stella, and Macduff Hughes for assis-
the improvement is from addressing the low tance with machine translation work, as well
recall over feminine pronouns in sentence-level as Mike Collins, Slav Petrov, and other mem-
translation (while retaining precision). We bers of the Google Research team for their
manually reviewed error cases and noted the feedback and suggestions on earlier versions of
major class was from domain shift: where an- this manuscript.
tecedents were close to pronouns in Wikipedia,
there were more frequent cases in WMT of References
antecedents being outside the BERT context
window, and these were misclassified. Future Daniel Andor, Chris Alberti, David Weiss, Aliaksei
Severyn, Alessandro Presta, Kuzman Ganchev,
17
In which case, input contains no special token. Slav Petrov, and Michael Collins. 2016.
18
https://github.com/idiap/APT Globally normalized transition-based neural networks.In Proceedings of the 54th Annual Meeting of Melvin Johnson, Mike Schuster, Quoc V.
the Association for Computational Linguistics Le, Maxim Krikun, Yonghui Wu, Zhifeng
(Volume 1: Long Papers), pages 2442–2452, Chen, Nikhil Thorat, Fernanda Viégas,
Berlin, Germany. Association for Computa- Martin Wattenberg, Greg Corrado, Mac-
tional Linguistics. duff Hughes, and Jeffrey Dean. 2017.
Google’s multilingual neural machine translation system: Enabl
Rachel Bawden, Rico Sennrich, Alexan- Transactions of the Association for Computa-
dra Birch, and Barry Haddow. 2018. tional Linguistics, 5:339–351.
Evaluating discourse phenomena in neural machine translation.
In Proceedings of the 2018 Conference of the Shaohui Kuang, Deyi Xiong, Wei-
North American Chapter of the Association for hua Luo, and Guodong Zhou. 2018.
Computational Linguistics: Human Language Modeling coherence for neural machine translation with dynami
Technologies, Volume 1 (Long Papers), pages In Proceedings of the 27th International Con-
1304–1313, New Orleans, Louisiana. Association ference on Computational Linguistics, pages
for Computational Linguistics. 596–606, Santa Fe, New Mexico, USA. Associa-
tion for Computational Linguistics.
Ondřej Bojar, Christian Buck, Chris Callison-
Burch, Christian Federmann, Barry Had- Samuel Läubli, Rico Sennrich, and Martin Volk.
dow, Philipp Koehn, Christof Monz, Matt 2018. Has machine translation achieved human
Post, Radu Soricut, and Lucia Specia. 2013. parity? a case for document-level evaluation. In
Findings of the 2013 Workshop on Statistical MachineProceedings
Translation.
of the 2018 Conference on Empirical
In Proceedings of the Eighth Workshop on Sta- Methods in Natural Language Processing, pages
tistical Machine Translation, pages 1–44, 4791–4796.
Sofia, Bulgaria. Association for Computational
Linguistics. Pierre Lison and Jrg Tiedemann. 2016. OpenSub-
titles2016: Extracting Large Parallel Corpora
Ondrej Cifka and Ondrej Bojar. 2018. from Movie and TV Subtitles. In Proceedings
Are BLEU and Meaning Representation in Opposition? of the Tenth International Conference on Lan-
In Proceedings of the 56th Annual Meeting of guage Resources and Evaluation (LREC’16).
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1362–1371, Sharid Loáiciga, Sara Stymne, Preslav Nakov,
Melbourne, Australia. Association for Compu- Christian Hardmeier, Jörg Tiedemann, Mauro
tational Linguistics. Cettolo, and Yannick Versley. 2017. Findings of
the 2017 DiscoMT shared task on cross-lingual
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and pronoun prediction. In The Third Workshop on
Kristina Toutanova. 2019. Bert: Pre-training Discourse in Machine Translation.
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con- Lesly Miculicich Werlen and An-
ference of the North American Chapter of the drei Popescu-Belis. 2017a.
Association for Computational Linguistics: Hu- Using coreference links to improve spanish-to-english machine tr
man Language Technologies, Volume 1 (Long In Proceedings of the 2nd Workshop on Corefer-
and Short Papers), pages 4171–4186. ence Resolution Beyond OntoNotes (CORBON
2017), pages 30–40, Valencia, Spain. Association
Liane Guillou and Christian Hardmeier. 2018. for Computational Linguistics.
Automatic reference-based evaluation of pronoun translation misses the point.
In Proceedings of the 2018 Conference on Em- Lesly Miculicich Werlen and Andrei Popescu-
pirical Methods in Natural Language Processing, Belis. 2017b. Validation of an Automatic Met-
pages 4797–4802, Brussels, Belgium. Associa- ric for the Accuracy of Pronoun Translation
tion for Computational Linguistics. (APT). In Proceedings of the Third Work-
shop on Discourse in Machine Translation (Dis-
Nizar Habash, Houda Bouamor, coMT), EPFL-CONF-229974. Association for
and Christine Chung. 2019. Computational Linguistics (ACL).
Automatic gender identification and reinflection in Arabic.
In Proceedings of the First Workshop on Gen- Amit Moryossef, Roee Aha-
der Bias in Natural Language Processing, roni, and Yoav Goldberg. 2019.
pages 155–165, Florence, Italy. Association for Filling gender & number gaps in neural machine translation wit
Computational Linguistics. In Proceedings of the First Workshop on Gen-
der Bias in Natural Language Processing,
Nizar Habash, Nasser Zalmout, Dima Taji, pages 49–54, Florence, Italy. Association for
Hieu Hoang, and Maverick Alzate. 2017. Computational Linguistics.
A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages.
In Proceedings of the 15th Conference of the Mathias Muller, Annette Rios, Elena
European Chapter of the Association for Com- Voita, and Rico Sennrich. 2018.
putational Linguistics: Volume 2, Short Papers, A large-scale test set for the evaluation of context-aware pronou
pages 235–241, Valencia, Spain. Association for In Proceedings of the Third Conference on Ma-
Computational Linguistics. chine Translation: Research Papers, pages61–72, Belgium, Brussels. Association for Eva Vanmassenhove, Christian Hard-
Computational Linguistics. meier, and Andy Way. 2018.
Getting gender right in neural machine translation.
Kishore Papineni, Salim Roukos, Todd In Proceedings of the 2018 Conference on Em-
Ward, and Wei-Jing Zhu. 2002. pirical Methods in Natural Language Processing,
Bleu: a method for automatic evaluation of machine translation.
pages 3003–3008, Brussels, Belgium. Associa-
In Proceedings of 40th Annual Meeting of the tion for Computational Linguistics.
Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania, Ashish Vaswani, Noam Shazeer, Niki Parmar,
USA. Association for Computational Linguis- Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
tics. ukasz Kaiser, and Illia Polosukhin. 2017. Atten-
tion is all you need. In Proceedings of NIPS,
Matthew E. Peters, Mark Neumann, Mohit Iyyer, pages 6000–6010, Long Beach, California.
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextual- Stephan Vogel, Hermann Ney, and Christoph Till-
ized word representations. In Proceedings of the mann. 1996. HMM-based word alignment in sta-
North American Chapter of the Association of tistical translation. In Proceedings of the 16th
Computational Linguistics. conference on Computational linguistics. Asso-
ciation for Computational Linguistics.
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. ”why should I trust you?”: Ex- Elena Voita, Pavel Serdyukov, Rico
plaining the predictions of any classifier. In Pro- Sennrich, and Ivan Titov. 2018.
ceedings of the 22nd ACM SIGKDD Interna- Context-aware neural machine translation learns anaphora resol
tional Conference on Knowledge Discovery and In Proceedings of the 56th Annual Meeting of
Data Mining, San Francisco, CA, USA, August the Association for Computational Linguistics
13-17, 2016, pages 1135–1144. (Volume 1: Long Papers), pages 1264–1274,
Melbourne, Australia. Association for Compu-
Karin Sim Smith. 2017. tational Linguistics.
On integrating discourse in machine translation.
In Proceedings of the Third Workshop on Dis- Claudia Wagner, Eduardo Graells-Garrido, David
course in Machine Translation, pages 110–121, Garcia, and Filippo Menczer. 2016. Women
Copenhagen, Denmark. Association for Com- through the glass ceiling: Gender assymetrics
putational Linguistics. in Wikipedia. EPJ Data Science, 5.
Jason R. Smith, Herve Saint-Amand, Mag- Longyue Wang, Zhaopeng Tu,
dalena Plamada, Philipp Koehn, Chris Andy Way, and Qun Liu. 2017.
Callison-Burch, and Adam Lopez. 2013. Exploiting cross-sentence context for neural machine translation
Dirt Cheap Web-Scale Parallel Text from the Common InCrawl.
Proceedings of the 2017 Conference on Em-
In Proceedings of the 51st Annual Meeting of pirical Methods in Natural Language Processing,
the Association for Computational Linguistics pages 2826–2831, Copenhagen, Denmark. Asso-
(Volume 1: Long Papers), pages 1374–1383, ciation for Computational Linguistics.
Sofia, Bulgaria. Association for Computational
Linguistics. Longyue Wang, Zhaopeng Tu,
Andy Way, and Qun Liu. 2018.
Dario Stojanovski and Alexander Fraser. 2018. Learning to jointly translate and predict dropped pronouns with
Coreference and coherence in neural machine In Proceedings of the 2018 Conference on Em-
translation: A study using oracle experiments. pirical Methods in Natural Language Processing,
In Proceedings of the Third Conference on Ma- pages 2997–3002, Brussels, Belgium. Associa-
chine Translation: Research Papers, pages 49– tion for Computational Linguistics.
60.
Kellie Webster, Marta Recasens, Vera Axelrod,
Jörg Tiedemann. 2012. Parallel data, tools and and Jason Baldridge. 2018. Mind the GAP: A
interfaces in OPUS. In Proceedings of the Balanced Corpus of Gendered Ambiguous Pro-
Eight International Conference on Language Re- nouns. In Transactions of the ACL, page to ap-
sources and Evaluation (LREC’12), Istanbul, pear.
Turkey. European Language Resources Associ-
ation (ELRA). Jiacheng Zhang, Huanbo Luan, Maosong
Sun, Feifei Zhai, Jingfang Xu,
Jörg Tiedemann and Yves Scherrer. 2017. Min Zhang, and Yang Liu. 2018.
Neural machine translation with extended context. Improving the transformer translation model with document-lev
In Proceedings of the Third Workshop on Dis- In Proceedings of the 2018 Conference on Em-
course in Machine Translation, pages 82–92, pirical Methods in Natural Language Processing,
Copenhagen, Denmark. Association for Com- pages 533–542, Brussels, Belgium. Association
putational Linguistics. for Computational Linguistics.Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan- Hao Huang, Muhao Chen, Ryan Cot- terell, and Kai-Wei Chang. 2019. Examining gender bias in languages with grammatical gender. In Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5276–5284, Hong Kong, China. Association for Computational Linguistics. Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0. In Proceedings of the 10th edition of the Language Resources and Evalua- tion Conference.
You can also read