First the worst: Finding better gender translations during beam search

Page created by Glen Payne

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

First the worst: Finding better gender translations during beam search

                                                                 Danielle Saunders and Rosie Sallis and Bill Byrne
                                                                Department of Engineering, University of Cambridge, UK
                                                                     {ds636, rs965, wjb31}@cam.ac.uk

                                                               Abstract                              The current dominant approaches to the problem,
                                              Neural machine translation inference proce-         which involve model retraining (Vanmassenhove
                                              dures like beam search generate the most likely     et al., 2018; Escudé Font and Costa-jussà, 2019;
                                              output under the model. This can exacer-            Stafanovičs et al., 2020) or tuning on targeted gen-
arXiv:2104.07429v1 [cs.CL] 15 Apr 2021

                                              bate any demographic biases exhibited by the        dered bilingual data (Saunders and Byrne, 2020;
                                              model. We focus on gender bias resulting from       Basta et al., 2020) – all of which risk introducing
                                              systematic errors in grammatical gender trans-      new sources of bias (Shah et al., 2020).
                                              lation, which can lead to human referents be-
                                                                                                     By contrast, we seek to find more diverse and
                                              ing misrepresented or misgendered.
                                                                                                  correct translations from existing models, biased as
                                              Most approaches to this problem adjust the          they may be. Recently, Roberts et al. (2020) have
                                              training data or the model. By contrast, we
                                                                                                  implicated beam search as a driver towards gender
                                              experiment with simply adjusting the infer-
                                              ence procedure. We experiment with rerank-          mistranslation and gender bias amplification – we
                                              ing nbest lists using gender features obtained      aim to mitigate this by guiding NMT inference
                                              automatically from the source sentence, and         towards finding better gender translations.
                                              applying gender constraints while decoding to          Our contributions are as follows: first, we ana-
                                              improve nbest list gender diversity. We find        lyze the nbest lists of NMT models exhibiting gen-
                                              that a combination of these techniques allows       der bias to determine whether more gender-diverse
                                              large gains in WinoMT accuracy without re-
                                                                                                  hypotheses can be found in the beam, and conclude
                                              quiring additional bilingual data or an addi-
                                              tional NMT model.                                   that they often cannot. We then generate new nbest
                                                                                                  lists subject to gendered inflection constraints, and
                                         1    Introduction                                        show that in this case the nbest lists have greater
                                         Neural language generation models optimized by           gender diversity. We demonstrate that it is possible
                                         likelihood have a tendency towards ‘safe’ word           to extract translations with the correct gendered
                                         choice. This lack of output diversity has been           entities from the original model’s beam.
                                         noted in machine translation, both statistical (Gim-        Our approach involves no changes to the model
                                         pel et al., 2013) and neural (Vanmassenhove et al.,      or training data, and requires only monolingual
                                         2019) as well as other areas of language genera-         resources for the source language (identifying gen-
                                         tion such as dialog modelling (Li et al., 2016) and      dered information in the source sentence) and target
                                         question generation (Sultan et al., 2020). Model-        language (gendered inflections.)
                                         generated language may be predictable, repetitive
                                         or stilted. More insidiously, generating the most        1.1   Related work
                                         likely output based on corpus statistics can amplify     Various approaches have been explored for mitigat-
                                         any existing biases in the corpus (Zhao et al., 2017).   ing gender bias in natural language processing gen-
                                            Potential harms arise when these biases around        erally, and machine translation specifically. Most
                                         e.g. word choice or grammatical gender inflections       involve some adjustment of the training data, either
                                         reflect demographic or social biases (Sun et al.,        directly (Rudinger et al., 2018) or by controlling
                                         2019). In the context of machine translation, the        word embeddings (Bolukbasi et al., 2016). We view
                                         resulting gender mistranslations could involve im-       such approaches as complementary to our own in-
                                         plicit misgendering of a user or other human refer-      vestigation, which seeks to take a better approach
                                         ent, or perpetuation of social stereotypes about the     to generating diverse translations during machine
                                         ‘typical’ gender of a referent in a given context.       translation inference. The approaches most rele-

vant to this work are gender control, and treatment
of gender errors as a correction problem.
Controlling gender translation generally involves
introducing external information into the model.
Miculicich Werlen and Popescu-Belis (2017) inte- Figure 1: Toy illustration of gender constraints for first-
pass translation ‘el médico’ or ‘la médica’.
grate cross-sentence coreference information into
hypothesis reranking to improve pronoun trans-
lation. Vanmassenhove et al. (2018) incorporate To summarize briefly, a ‘gender inflection lattice’ is
sentence-level ‘speaker gender’ tags into training constructed for the target language, defining gram-
data, while Moryossef et al. (2019) incorporate matically gendered reinflection options for words
a sentence-level gender feature during inference. in a large vocabulary list. For example, in the Span-
Stafanovičs et al. (2020) include factored annota- ish lattice, the definite articles la and el can option-
tions of target grammatical gender for all source ally be interchanged, as can the profession nouns
words, while Saunders et al. (2020) incorporate médica and médico. Composing the inflection lat-
inline source sentence gender tags for specific enti- tice with the original translation provides gendered
ties. As for much of this prior work, our focus is vocabulary constraints that can be applied during a
not on the discovery of the correct gender tag, but second pass beam search, as illustrated in Figure 1.
on possible applications for this information once In practice the lattices are constructed using heuris-
it is obtained. tics that could result in spurious arcs like ‘médic’,
A separate line of work treats the presence of but these are expected to have low likelihood under
gender-related inconsistencies or inaccuracies as a translation model.
a search and correction problem. Roberts et al. Unlike the original authors we do not rescore the
(2020) find that a sampling search procedure re- translations with a separate gender-sensitive trans-
sults in less misgendering than beam search. Saun- lation model. Instead, we use the baseline model
ders and Byrne (2020) rescore translations with a to produce gendered variations of its own biased
fine-tuned model that has incorporated additional hypotheses. This removes the additional experi-
gender sensitivity, while constraining rescoring to mental complexity and risk of bias amplification
a gender-inflection lattice. A related line of work resulting from tuning a model for effective gender
less directly applicable to translation focuses on translation, as well as the computational load of
reinflecting monolingual sentences to reduce the storing the separate model.
likelihood of bias (Habash et al., 2019; Alhafni While we carry out two full inference passes for
et al., 2020; Sun et al., 2021). An important dis- simplicity of implementation, we note that further
tinction between our work and earlier reinflection efficiency improvements are possible. For example,
approaches is that we only use a single model for the source sentence encoding could be saved and
each language pair, significantly reducing compu- reused for the reinflection pass. In principle, a more
tation and complexity. complex beam search implementation could apply
constraints during the first inference pass, negating
2 Finding consistent gender in the beam
the need for two passes. These efficiency gains
There are two elements to our proposed approach. would not be possible if using a separate NMT
First, we produce an nbest list using our baseline model to rescore or reinflect the translation.
model. We either use simple beam search or a
two-pass approach where the second pass searches 2.2 Reranking gendered translations
for differently-gendered versions of the best first- We focus on the translation of sentences contain-
pass hypothesis. Next, once we have our set of ing entities that are coreferent with gendered pro-
translation hypotheses, we rerank them according nouns. Our method to rerank gendered translations
to gender criteria and extract the translation which uses information about the gender of entities in
best fulfils our requirements. the source sentence. In the WinoMT challenge
set, each sentence’s primary entity is labelled - we
2.1 Gender-constrained beam search refer to use of this information as ‘oracle’ rerank-
We follow the gendered-inflection lattice search ing. Similar known-gender information could be
scheme as described in Saunders and Byrne (2020). defined by a system user. As an alternative we can

use an off-the-shelf coreference resolution tool on (Liu et al., 2019) tuned for pronoun disambigua-
the source language designed for pronoun disam- tion on Winograd Schema Challenge data4 .
biguation challenges (e.g. Rahman and Ng (2012)). We assess gender translation on WinoMT
Once the pronoun-coreferent source word is (Stanovsky et al., 2019) as the only gender chal-
identified, we can find the aligned target lan- lenge set we are currently aware of for both lan-
guage hypothesis word, similar to the approach in guage pairs. Our metrics are overall accuracy
Stafanovičs et al. (2020). We determine the gram- (higher is better) and ∆G, a metric measuring F1
matical gender of the hypothesis by morphological score difference between male and female labelled
analysis, either using hard-coded rules or off-the- sentences (closer to 0 is better).
shelf tools depending on language as in Stanovsky
et al. (2019). We proceed to rerank the list of hy- 4 Results and discussion
potheses first by article gender agreement with the
source pronoun, and then by log likelihood. If there Beam Gender en-de en-es
search BLEU BLEU
are no gender agreeing hypotheses, we default to × 42.7 27.5
4
log likelihood alone1 . X 42.7 27.8
× 42.3 27.3
20
3 Experimental setup X 42.7 27.8

3.1 Baseline models and data Table 1: Cased BLEU on the generic test sets under
different beam size and search procedures. Gender-
We assess translation from English into German constrained rescoring by the original model does not
and Spanish. All models are Transformer ‘base’ hurt BLEU, and can improve it.
models with 6 layers (Vaswani et al., 2017), trained
using Tensor2Tensor with batch size 4K (Vaswani
et al., 2018). The en-de model is trained on 17.6M 4.1 Normal beam search has some gender
sentences from WMT news tasks for 300K itera- diversity
tions (Barrault et al., 2020), and the en-es model We first evaluate beam search without any gen-
on 10M sentences from the UN Corpus for 150K der reranking. In Table 1 we assess BLEU for
iterations (Ziemski et al., 2016). beam search with beam sizes 4 and 20, with and
We assess translation quality in terms of BLEU without gender constraints. Beam search with the
on a separate, untagged general test set. We report larger beam 20 and without gender constraints de-
cased BLEU on the test sets from WMT18 (en- grades BLEU score, in line with the findings of
de), WMT13 (en-es) and IWSLT14 (en-he) using Stahlberg and Byrne (2019). In Table 2 we see
SacreBLEU2 (Post, 2018). that WinoMT gender accuracy also degrades with a
3.2 Gender constrained search and larger beam. However, our reranking procedure is
reranking able to find better gender translations for WinoMT
sentences in the unconstrained beam. With beam
At inference time we perform beam search with and size 4 WinoMT accuracy improves by 10% rela-
without reranking and gender-inflected lattice con- tive to the original scores, and ∆G improves by
straints. We construct gender reinflection lattices an even larger margin. Improvements are much
for rescoring as proposed by Saunders and Byrne greater when reranking even an unconstrained 20-
(2020)3 . For reranking, we use fast_align (Dyer best list, suggesting that standard beam search may
et al., 2013) to link the source sentence gender in- exhibit gender diversity, if not necessarily in the
formation to the target sentence. When analysing highest likelihood hypotheses.
the gender of target language words, we use the
same morphological analysis as for the correspond- 4.2 We can find better gender translations in
ing languages in WinoMT evaluation. the constrained beam
When obtaining gender information automati- We next experiment with applying gender con-
cally for reranking we use the RoBERTa model straints to a second-pass beam search. We note that
1
Code to be made available constrained beam search mitigates BLEU degrada-
2
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+v.1.4.8 tion (Table 1) – a useful side-effect. For gender
3
Scripts and data for lattice construction made available
4
by the authors of Saunders and Byrne (2020) at https:// Model from https://github.com/pytorch/
github.com/DCSaunders/gender-debias fairseq/tree/master/examples/roberta/wsc

Beam     Oracle reranking   Gender search       en-de         en-es
                                                                  Acc ∆G        Acc ∆G
                                     ×                ×           60.1 18.6     47.8 38.4
                         4           X                ×           66.8 10.0     53.9 27.8
                                     X                X           77.4   0.0    56.2 22.8
                                     ×                ×           59.0 20.1     46.4 40.7
                      20             X                ×           75.2   1.8    63.6 13.8
                                     X                X           84.5 -3.1     68.0    7.6

Table 2: WinoMT primary-entity accuracy (Acc), and male/female F1 difference ∆G, under various search and
ranking procedures. Gender search without reranking would be similar to the baseline, as gender constraints are
constructed on the highest likelihood hypothesis from normal beam-4 search. Table 1 demonstrates BLEU under
constrained search without reranking: the scores are extremely close to the baselines.

                   Beam      RoBERTa reranking    Gender search       en-de         en-es
                                                                   Acc ∆G        Acc ∆G
                                     ×                  ×          60.1 18.6     47.8 38.4
                     4               X                  ×          66.4 10.4     53.0 28.6
                                     X                  X          76.1   1.0    55.2 24.1
                                     ×                  ×          59.0 20.1     46.4 40.7
                     20              X                  ×          73.9   2.7    60.9 15.6
                                     X                  X          82.3 -2.2     65.6    9.8

Table 3: WinoMT primary-entity accuracy (Acc), and male/female WinoMT sentence F1 difference ∆G under
various search and ranking procedures. Baseline scores with no reranking or gender search are duplicated from
Table 2. Reranking gender labels determined automatically from the source sentence by a RoBERTa model.

translation, Table 2 shows WinoMT accuracy after          coreference resolution.
gender constrained search and reranking dramati-             One benefit of this approach that we have not
cally improving relative to the baseline scores. In       explored is its application to incorporating context:
the appendix we give examples of 4-best lists gen-        the pronoun disambiguation information could po-
erated with and without gender constraints, high-         tentially come from a previous or subsequent sen-
lighting why this is the case.                            tence. Another benefit is the approach’s flexibility
                                                          to introducing new gendered vocabulary: anything
4.3   Controlling gender output with                      that can be both defined for reranking and disam-
      RoBERTa                                             biguated in the source sentence can be searched for
                                                          in the beam without requiring retraining.
So far we have shown accuracy improvements us-
ing oracle gender, which indicate that the nbest          5   Conclusions
lists often contain a gender-accurate hypothesis.
In Table 3, we explore improvements reranking             We have explored improving gender translation by
with source gender determined automatically by a          correcting beam search. Our experiments demon-
RoBERTa model. There is surprisingly little degra-        strate that applying target language constraints at
dation in WinoMT accuracy when using RoBERTa              inference time can encourage models to produce
coreference resolution compared to the oracle (Ta-        more gender-diverse hypotheses. Moreover we
ble 2). We hypothesise that sentences which are           show that simple reranking heuristics can extract
challenging for pronoun disambiguation will also          more accurate gender translations from these nbest
be among the most challenging to translate, so the        lists using either oracle information or automatic
sentences where RoBERTa disambiguates wrongly             source coreference resolution.
are already ‘wrong’ even with oracle information.            Unlike other approaches to this problem we do
   We note that Saunders and Byrne (2020)                 not attempt to counter unidentified and potentially
achieved 81.7% and 68.4% WinoMT accuracy on               intractable sources of bias in the training data,
German and Spanish respectively after rescoring           or produce new models. However, we view our
with a separate model adapted to a synthetic occu-        schemes as orthogonal to model or data bias miti-
pations dataset. With beam size 20 we are able to         gation. If a model ranks diverse gender translations
achieve very similar results without any change to        higher in the beam initially, finding better gender
our single translation model, even using automatic        translations during beam search becomes simpler.

Impact statement a decoder-based neural machine translation model
by adding contextual information. In Proceedings
Where machine translation is used in people’s lives, of the The Fourth Widening Natural Language Pro-
mistranslations have the potential to misrepresent cessing Workshop, pages 99–102, Seattle, USA. As-
people. This is the case when personal character- sociation for Computational Linguistics.
istics like social gender conflict with model biases Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,
towards certain forms of grammatical gender. As Venkatesh Saligrama, and Adam T Kalai. 2016.
mentioned in the introduction, the result can in- Man is to computer programmer as woman is to
homemaker? Debiasing word embeddings. In Ad-
volve implicit misgendering of a user or other hu- vances in neural information processing systems,
man referent, or perpetuation of social biases about pages 4349–4357.
gender roles as represented in the translation. A
Chris Dyer, Victor Chahuneau, and Noah A. Smith.
user whose words are translated with gender de-
2013. A simple, fast, and effective reparameter-
faults that imply they hold such biased views will ization of IBM model 2. In Proceedings of the
also be misrepresented. 2013 Conference of the North American Chapter of
We attempt to avoid these failure modes by iden- the Association for Computational Linguistics: Hu-
tifying translations which are at least consistent man Language Technologies, pages 644–648, At-
lanta, Georgia. Association for Computational Lin-
within the translation and with the source sentence. guistics.
This is dependent on identifying grammatically
gendered terms in the target language – however, Joel Escudé Font and Marta R. Costa-jussà. 2019.
Equalizing gender bias in neural machine transla-
this element is very flexible and can be updated for tion with word embeddings techniques. In Proceed-
new gendered terminology. We note that models ings of the First Workshop on Gender Bias in Natu-
which do not account for variety in gender expres- ral Language Processing, pages 147–154, Florence,
sion such as neopronoun use may not be capable Italy. Association for Computational Linguistics.
of generating appropriate gender translations, but Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory
in principle a variety of gender translations could Shakhnarovich. 2013. A systematic exploration of
be extracted from the beam. diversity in machine translation. In Proceedings of
the 2013 Conference on Empirical Methods in Natu-
By removing the data augmentation, tuning and ral Language Processing, pages 1100–1111, Seattle,
retraining elements from approaches that address Washington, USA. Association for Computational
gender translation, we seek to simplify the pro- Linguistics.
cess and remove additional stages during which Nizar Habash, Houda Bouamor, and Christine Chung.
bias could be introduced or amplified (Shah et al., 2019. Automatic gender identification and reinflec-
2020). tion in Arabic. In Proceedings of the First Work-
shop on Gender Bias in Natural Language Process-
ing, pages 155–165, Florence, Italy. Association for
References Computational Linguistics.

Bashar Alhafni, Nizar Habash, and Houda Bouamor. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
2020. Gender-aware reinflection using linguistically and Bill Dolan. 2016. A diversity-promoting ob-
enhanced neural models. In Proceedings of the Sec- jective function for neural conversation models. In
ond Workshop on Gender Bias in Natural Language Proceedings of the 2016 Conference of the North
Processing, pages 139–150, Barcelona, Spain (On- American Chapter of the Association for Computa-
line). Association for Computational Linguistics. tional Linguistics: Human Language Technologies,
pages 110–119, San Diego, California. Association
Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, for Computational Linguistics.
Marta R. Costa-jussà, Christian Federmann, Yvette
Graham, Roman Grundkiewicz, Barry Haddow, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Matthias Huck, Eric Joanis, Tom Kocmi, Philipp dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Monz, Makoto Morishita, Masaaki Nagata, Toshi- RoBERTa: A robustly optimized BERT pretraining
aki Nakazawa, Santanu Pal, Matt Post, and Marcos approach. arXiv preprint arXiv:1907.11692.
Zampieri. 2020. Findings of the 2020 conference on
machine translation (WMT20). In Proceedings of Lesly Miculicich Werlen and Andrei Popescu-Belis.
the Fifth Conference on Machine Translation, pages 2017. Using coreference links to improve Spanish-
1–55, Online. Association for Computational Lin- to-English machine translation. In Proceedings of
guistics. the 2nd Workshop on Coreference Resolution Be-
yond OntoNotes (CORBON 2017), pages 30–40, Va-
Christine Basta, Marta R. Costa-jussà, and José A. R. lencia, Spain. Association for Computational Lin-
Fonollosa. 2020. Towards mitigating gender bias in guistics.

Amit Moryossef, Roee Aharoni, and Yoav Goldberg. Felix Stahlberg and Bill Byrne. 2019. On NMT search
2019. Filling gender & number gaps in neural ma- errors and model errors: Cat got your tongue? In
chine translation with black-box context injection. Proceedings of the 2019 Conference on Empirical
In Proceedings of the First Workshop on Gender Methods in Natural Language Processing and the
Bias in Natural Language Processing, pages 49–54, 9th International Joint Conference on Natural Lan-
Florence, Italy. Association for Computational Lin- guage Processing (EMNLP-IJCNLP), pages 3356–
guistics. 3362, Hong Kong, China. Association for Computa-
tional Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on Gabriel Stanovsky, Noah A. Smith, and Luke Zettle-
Machine Translation: Research Papers, pages 186– moyer. 2019. Evaluating gender bias in machine
191, Belgium, Brussels. Association for Computa- translation. In Proceedings of the 57th Annual Meet-
tional Linguistics. ing of the Association for Computational Linguistics,
pages 1679–1684, Florence, Italy. Association for
Altaf Rahman and Vincent Ng. 2012. Resolving com- Computational Linguistics.
plex cases of definite pronouns: The Winograd
Md Arafat Sultan, Shubham Chandel, Ramón Fernan-
schema challenge. In Proceedings of the 2012 Joint
dez Astudillo, and Vittorio Castelli. 2020. On the
Conference on Empirical Methods in Natural Lan-
importance of diversity in question generation for
guage Processing and Computational Natural Lan-
QA. In Proceedings of the 58th Annual Meeting
guage Learning, pages 777–789, Jeju Island, Korea.
of the Association for Computational Linguistics,
Association for Computational Linguistics.
pages 5651–5656, Online. Association for Compu-
tational Linguistics.
Nicholas Roberts, Davis Liang, Graham Neubig, and
Zachary C Lipton. 2020. Decoding and di- Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang,
versity in machine translation. arXiv preprint Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth
arXiv:2011.13477. Belding, Kai-Wei Chang, and William Yang Wang.
2019. Mitigating gender bias in natural language
Rachel Rudinger, Jason Naradowsky, Brian Leonard, processing: Literature review. In Proceedings of
and Benjamin Van Durme. 2018. Gender bias in the 57th Annual Meeting of the Association for Com-
coreference resolution. In Proceedings of the 2018 putational Linguistics, pages 1630–1640, Florence,
Conference of the North American Chapter of the Italy. Association for Computational Linguistics.
Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Papers), Tony Sun, Kellie Webster, Apu Shah, William Yang
pages 8–14, New Orleans, Louisiana. Association Wang, and Melvin Johnson. 2021. They, them,
for Computational Linguistics. theirs: Rewriting with gender-neutral english. arXiv
preprint arXiv:2102.06788.
Danielle Saunders and Bill Byrne. 2020. Reducing gen-
der bias in neural machine translation as a domain Eva Vanmassenhove, Christian Hardmeier, and Andy
adaptation problem. In Proceedings of the 58th An- Way. 2018. Getting gender right in neural machine
nual Meeting of the Association for Computational translation. In Proceedings of the 2018 Conference
Linguistics, pages 7724–7736, Online. Association on Empirical Methods in Natural Language Process-
for Computational Linguistics. ing, pages 3003–3008, Brussels, Belgium. Associa-
tion for Computational Linguistics.
Danielle Saunders, Rosie Sallis, and Bill Byrne. 2020.
Eva Vanmassenhove, Dimitar Shterionov, and Andy
Neural machine translation doesn’t translate gender
Way. 2019. Lost in translation: Loss and decay of
coreference right unless you make it. In Proceed-
linguistic richness in machine translation. In Pro-
ings of the Second Workshop on Gender Bias in Nat-
ceedings of Machine Translation Summit XVII Vol-
ural Language Processing, pages 35–43, Barcelona,
ume 1: Research Track, pages 222–232, Dublin,
Spain (Online). Association for Computational Lin-
Ireland. European Association for Machine Transla-
guistics.
tion.
Deven Santosh Shah, H. Andrew Schwartz, and Dirk Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-
Hovy. 2020. Predictive biases in natural language cois Chollet, Aidan Gomez, Stephan Gouws, Llion
processing models: A conceptual framework and Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par-
overview. In Proceedings of the 58th Annual Meet- mar, Ryan Sepassi, Noam Shazeer, and Jakob Uszko-
ing of the Association for Computational Linguistics, reit. 2018. Tensor2Tensor for neural machine trans-
pages 5248–5264, Online. Association for Computa- lation. In Proceedings of the 13th Conference of the
tional Linguistics. Association for Machine Translation in the Ameri-
cas (Volume 1: Research Papers), pages 193–199,
Artūrs Stafanovičs, Toms Bergmanis, and Mārcis Pin- Boston, MA. Association for Machine Translation
nis. 2020. Mitigating gender bias in machine trans- in the Americas.
lation with target gender annotations. In Proceed-
ings of the Fifth Conference on Machine Translation Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
(WMT). Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all
    you need. In Advances in Neural Information Pro-
    cessing Systems, pages 6000–6010.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-
   donez, and Kai-Wei Chang. 2017. Men also like
   shopping: Reducing gender bias amplification using
   corpus-level constraints. In Proceedings of the 2017
   Conference on Empirical Methods in Natural Lan-
   guage Processing, pages 2979–2989, Copenhagen,
   Denmark. Association for Computational Linguis-
   tics.
Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno
  Pouliquen. 2016. The united nations parallel cor-
  pus v1.0. In Proceedings of the Tenth Interna-
  tional Conference on Language Resources and Eval-
  uation (LREC 2016), pages 3530–3534, Portorož,
  Slovenia. European Language Resources Associa-
  tion (ELRA).

A     Constrained vs unconstrained beams
The difference in gender diversity between con-
strained and unconstrained beam search can be ob-
served by examining the beams. Table 4 (next page)
gives examples of 4-best lists for two WinoMT sen-
tences translated into German. The lists are pro-
duced with and without gender constraints, and are
ordered by model likelihood in both cases.
   In both cases the best-scoring hypothesis fails
to resolve gender coreference correctly, motivating
the reranking scheme we describe in this paper.
However, the effects of gender constrained beam
search become evident if we compare the 4-best
lists. With regular unconstrained beam search, the
model simply switches out more common lexical
alternatives, such as different forms of the past
tense. However, the grammatical genders of both
entities in both examples are unchanged throughout
the beam.
   By contrast, using gender-constrained beam
search constrains changes in the nbest list to gen-
dered inflections of the original best-scoring trans-
lation, as we expect. The changes made under
beam search vary the grammatical genders of arti-
cles and entities, resulting in more gender-diverse
beams. We note that in the second example, the
definite articles vary in terms of gender but the
translated profession terms only vary in one case
out of eight. This behaviour is mitigated with larger
beam sizes.

English source The cook prepared a dish for the teacher because she just learned a new dish.
Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hatte.
Unconstrained
Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hat.
Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht erlernte.
Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
Der Koch bereitete ein Gericht für den Lehrer, weil sie gerade ein neues Gericht gelernt.
Constrained
Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade eine neues Gericht gelernt.
English source The cook prepared a dish for the teacher because he is hungry.
Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
Der Koch bereitete dem Lehrer ein Gericht vor, weil er hungrig ist.
Unconstrained
Der Koch bereitete ein Gericht für den Lehrer vor, weil er hungrig ist.
Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig.
Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
Die Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
Constrained
Der Koch bereitete ein Gericht für die Lehrer, weil er hungrig ist.
Die Köchin bereitete ein Gericht für den Lehrer, weil er hungrig ist.

Table 4: Results of beam-4 search for English-German translation for two WinoMT sentences with and without
gender constraints. 4-best lists are ordered by likelihood under the NMT model.

You can also read