First the worst: Finding better gender translations during beam search

Page created by Glen Payne
 
CONTINUE READING
First the worst: Finding better gender translations during beam search

                                                                 Danielle Saunders and Rosie Sallis and Bill Byrne
                                                                Department of Engineering, University of Cambridge, UK
                                                                     {ds636, rs965, wjb31}@cam.ac.uk

                                                               Abstract                              The current dominant approaches to the problem,
                                              Neural machine translation inference proce-         which involve model retraining (Vanmassenhove
                                              dures like beam search generate the most likely     et al., 2018; Escudé Font and Costa-jussà, 2019;
                                              output under the model. This can exacer-            Stafanovičs et al., 2020) or tuning on targeted gen-
arXiv:2104.07429v1 [cs.CL] 15 Apr 2021

                                              bate any demographic biases exhibited by the        dered bilingual data (Saunders and Byrne, 2020;
                                              model. We focus on gender bias resulting from       Basta et al., 2020) – all of which risk introducing
                                              systematic errors in grammatical gender trans-      new sources of bias (Shah et al., 2020).
                                              lation, which can lead to human referents be-
                                                                                                     By contrast, we seek to find more diverse and
                                              ing misrepresented or misgendered.
                                                                                                  correct translations from existing models, biased as
                                              Most approaches to this problem adjust the          they may be. Recently, Roberts et al. (2020) have
                                              training data or the model. By contrast, we
                                                                                                  implicated beam search as a driver towards gender
                                              experiment with simply adjusting the infer-
                                              ence procedure. We experiment with rerank-          mistranslation and gender bias amplification – we
                                              ing nbest lists using gender features obtained      aim to mitigate this by guiding NMT inference
                                              automatically from the source sentence, and         towards finding better gender translations.
                                              applying gender constraints while decoding to          Our contributions are as follows: first, we ana-
                                              improve nbest list gender diversity. We find        lyze the nbest lists of NMT models exhibiting gen-
                                              that a combination of these techniques allows       der bias to determine whether more gender-diverse
                                              large gains in WinoMT accuracy without re-
                                                                                                  hypotheses can be found in the beam, and conclude
                                              quiring additional bilingual data or an addi-
                                              tional NMT model.                                   that they often cannot. We then generate new nbest
                                                                                                  lists subject to gendered inflection constraints, and
                                         1    Introduction                                        show that in this case the nbest lists have greater
                                         Neural language generation models optimized by           gender diversity. We demonstrate that it is possible
                                         likelihood have a tendency towards ‘safe’ word           to extract translations with the correct gendered
                                         choice. This lack of output diversity has been           entities from the original model’s beam.
                                         noted in machine translation, both statistical (Gim-        Our approach involves no changes to the model
                                         pel et al., 2013) and neural (Vanmassenhove et al.,      or training data, and requires only monolingual
                                         2019) as well as other areas of language genera-         resources for the source language (identifying gen-
                                         tion such as dialog modelling (Li et al., 2016) and      dered information in the source sentence) and target
                                         question generation (Sultan et al., 2020). Model-        language (gendered inflections.)
                                         generated language may be predictable, repetitive
                                         or stilted. More insidiously, generating the most        1.1   Related work
                                         likely output based on corpus statistics can amplify     Various approaches have been explored for mitigat-
                                         any existing biases in the corpus (Zhao et al., 2017).   ing gender bias in natural language processing gen-
                                            Potential harms arise when these biases around        erally, and machine translation specifically. Most
                                         e.g. word choice or grammatical gender inflections       involve some adjustment of the training data, either
                                         reflect demographic or social biases (Sun et al.,        directly (Rudinger et al., 2018) or by controlling
                                         2019). In the context of machine translation, the        word embeddings (Bolukbasi et al., 2016). We view
                                         resulting gender mistranslations could involve im-       such approaches as complementary to our own in-
                                         plicit misgendering of a user or other human refer-      vestigation, which seeks to take a better approach
                                         ent, or perpetuation of social stereotypes about the     to generating diverse translations during machine
                                         ‘typical’ gender of a referent in a given context.       translation inference. The approaches most rele-
vant to this work are gender control, and treatment
of gender errors as a correction problem.
    Controlling gender translation generally involves
introducing external information into the model.
Miculicich Werlen and Popescu-Belis (2017) inte-        Figure 1: Toy illustration of gender constraints for first-
                                                        pass translation ‘el médico’ or ‘la médica’.
grate cross-sentence coreference information into
hypothesis reranking to improve pronoun trans-
lation. Vanmassenhove et al. (2018) incorporate         To summarize briefly, a ‘gender inflection lattice’ is
sentence-level ‘speaker gender’ tags into training      constructed for the target language, defining gram-
data, while Moryossef et al. (2019) incorporate         matically gendered reinflection options for words
a sentence-level gender feature during inference.       in a large vocabulary list. For example, in the Span-
Stafanovičs et al. (2020) include factored annota-     ish lattice, the definite articles la and el can option-
tions of target grammatical gender for all source       ally be interchanged, as can the profession nouns
words, while Saunders et al. (2020) incorporate         médica and médico. Composing the inflection lat-
inline source sentence gender tags for specific enti-   tice with the original translation provides gendered
ties. As for much of this prior work, our focus is      vocabulary constraints that can be applied during a
not on the discovery of the correct gender tag, but     second pass beam search, as illustrated in Figure 1.
on possible applications for this information once      In practice the lattices are constructed using heuris-
it is obtained.                                         tics that could result in spurious arcs like ‘médic’,
    A separate line of work treats the presence of      but these are expected to have low likelihood under
gender-related inconsistencies or inaccuracies as       a translation model.
a search and correction problem. Roberts et al.            Unlike the original authors we do not rescore the
(2020) find that a sampling search procedure re-        translations with a separate gender-sensitive trans-
sults in less misgendering than beam search. Saun-      lation model. Instead, we use the baseline model
ders and Byrne (2020) rescore translations with a       to produce gendered variations of its own biased
fine-tuned model that has incorporated additional       hypotheses. This removes the additional experi-
gender sensitivity, while constraining rescoring to     mental complexity and risk of bias amplification
a gender-inflection lattice. A related line of work     resulting from tuning a model for effective gender
less directly applicable to translation focuses on      translation, as well as the computational load of
reinflecting monolingual sentences to reduce the        storing the separate model.
likelihood of bias (Habash et al., 2019; Alhafni           While we carry out two full inference passes for
et al., 2020; Sun et al., 2021). An important dis-      simplicity of implementation, we note that further
tinction between our work and earlier reinflection      efficiency improvements are possible. For example,
approaches is that we only use a single model for       the source sentence encoding could be saved and
each language pair, significantly reducing compu-       reused for the reinflection pass. In principle, a more
tation and complexity.                                  complex beam search implementation could apply
                                                        constraints during the first inference pass, negating
2     Finding consistent gender in the beam
                                                        the need for two passes. These efficiency gains
There are two elements to our proposed approach.        would not be possible if using a separate NMT
First, we produce an nbest list using our baseline      model to rescore or reinflect the translation.
model. We either use simple beam search or a
two-pass approach where the second pass searches        2.2   Reranking gendered translations
for differently-gendered versions of the best first-    We focus on the translation of sentences contain-
pass hypothesis. Next, once we have our set of          ing entities that are coreferent with gendered pro-
translation hypotheses, we rerank them according        nouns. Our method to rerank gendered translations
to gender criteria and extract the translation which    uses information about the gender of entities in
best fulfils our requirements.                          the source sentence. In the WinoMT challenge
                                                        set, each sentence’s primary entity is labelled - we
2.1    Gender-constrained beam search                   refer to use of this information as ‘oracle’ rerank-
We follow the gendered-inflection lattice search        ing. Similar known-gender information could be
scheme as described in Saunders and Byrne (2020).       defined by a system user. As an alternative we can
use an off-the-shelf coreference resolution tool on            (Liu et al., 2019) tuned for pronoun disambigua-
the source language designed for pronoun disam-                tion on Winograd Schema Challenge data4 .
biguation challenges (e.g. Rahman and Ng (2012)).                 We assess gender translation on WinoMT
   Once the pronoun-coreferent source word is                  (Stanovsky et al., 2019) as the only gender chal-
identified, we can find the aligned target lan-                lenge set we are currently aware of for both lan-
guage hypothesis word, similar to the approach in              guage pairs. Our metrics are overall accuracy
Stafanovičs et al. (2020). We determine the gram-             (higher is better) and ∆G, a metric measuring F1
matical gender of the hypothesis by morphological              score difference between male and female labelled
analysis, either using hard-coded rules or off-the-            sentences (closer to 0 is better).
shelf tools depending on language as in Stanovsky
et al. (2019). We proceed to rerank the list of hy-            4     Results and discussion
potheses first by article gender agreement with the
source pronoun, and then by log likelihood. If there                    Beam   Gender    en-de     en-es
                                                                               search    BLEU      BLEU
are no gender agreeing hypotheses, we default to                                 ×        42.7      27.5
                                                                          4
log likelihood alone1 .                                                          X        42.7      27.8
                                                                                 ×        42.3      27.3
                                                                          20
3     Experimental setup                                                         X        42.7      27.8

3.1    Baseline models and data                                Table 1: Cased BLEU on the generic test sets under
                                                               different beam size and search procedures. Gender-
We assess translation from English into German                 constrained rescoring by the original model does not
and Spanish. All models are Transformer ‘base’                 hurt BLEU, and can improve it.
models with 6 layers (Vaswani et al., 2017), trained
using Tensor2Tensor with batch size 4K (Vaswani
et al., 2018). The en-de model is trained on 17.6M             4.1    Normal beam search has some gender
sentences from WMT news tasks for 300K itera-                         diversity
tions (Barrault et al., 2020), and the en-es model             We first evaluate beam search without any gen-
on 10M sentences from the UN Corpus for 150K                   der reranking. In Table 1 we assess BLEU for
iterations (Ziemski et al., 2016).                             beam search with beam sizes 4 and 20, with and
   We assess translation quality in terms of BLEU              without gender constraints. Beam search with the
on a separate, untagged general test set. We report            larger beam 20 and without gender constraints de-
cased BLEU on the test sets from WMT18 (en-                    grades BLEU score, in line with the findings of
de), WMT13 (en-es) and IWSLT14 (en-he) using                   Stahlberg and Byrne (2019). In Table 2 we see
SacreBLEU2 (Post, 2018).                                       that WinoMT gender accuracy also degrades with a
3.2    Gender constrained search and                           larger beam. However, our reranking procedure is
       reranking                                               able to find better gender translations for WinoMT
                                                               sentences in the unconstrained beam. With beam
At inference time we perform beam search with and              size 4 WinoMT accuracy improves by 10% rela-
without reranking and gender-inflected lattice con-            tive to the original scores, and ∆G improves by
straints. We construct gender reinflection lattices            an even larger margin. Improvements are much
for rescoring as proposed by Saunders and Byrne                greater when reranking even an unconstrained 20-
(2020)3 . For reranking, we use fast_align (Dyer               best list, suggesting that standard beam search may
et al., 2013) to link the source sentence gender in-           exhibit gender diversity, if not necessarily in the
formation to the target sentence. When analysing               highest likelihood hypotheses.
the gender of target language words, we use the
same morphological analysis as for the correspond-             4.2    We can find better gender translations in
ing languages in WinoMT evaluation.                                   the constrained beam
   When obtaining gender information automati-             We next experiment with applying gender con-
cally for reranking we use the RoBERTa model               straints to a second-pass beam search. We note that
    1
      Code to be made available                            constrained beam search mitigates BLEU degrada-
    2
      BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+v.1.4.8 tion (Table 1) – a useful side-effect. For gender
    3
    Scripts and data for lattice construction made available
                                                                 4
by the authors of Saunders and Byrne (2020) at https://            Model from https://github.com/pytorch/
github.com/DCSaunders/gender-debias                            fairseq/tree/master/examples/roberta/wsc
Beam     Oracle reranking   Gender search       en-de         en-es
                                                                  Acc ∆G        Acc ∆G
                                     ×                ×           60.1 18.6     47.8 38.4
                         4           X                ×           66.8 10.0     53.9 27.8
                                     X                X           77.4   0.0    56.2 22.8
                                     ×                ×           59.0 20.1     46.4 40.7
                      20             X                ×           75.2   1.8    63.6 13.8
                                     X                X           84.5 -3.1     68.0    7.6

Table 2: WinoMT primary-entity accuracy (Acc), and male/female F1 difference ∆G, under various search and
ranking procedures. Gender search without reranking would be similar to the baseline, as gender constraints are
constructed on the highest likelihood hypothesis from normal beam-4 search. Table 1 demonstrates BLEU under
constrained search without reranking: the scores are extremely close to the baselines.

                   Beam      RoBERTa reranking    Gender search       en-de         en-es
                                                                   Acc ∆G        Acc ∆G
                                     ×                  ×          60.1 18.6     47.8 38.4
                     4               X                  ×          66.4 10.4     53.0 28.6
                                     X                  X          76.1   1.0    55.2 24.1
                                     ×                  ×          59.0 20.1     46.4 40.7
                     20              X                  ×          73.9   2.7    60.9 15.6
                                     X                  X          82.3 -2.2     65.6    9.8

Table 3: WinoMT primary-entity accuracy (Acc), and male/female WinoMT sentence F1 difference ∆G under
various search and ranking procedures. Baseline scores with no reranking or gender search are duplicated from
Table 2. Reranking gender labels determined automatically from the source sentence by a RoBERTa model.

translation, Table 2 shows WinoMT accuracy after          coreference resolution.
gender constrained search and reranking dramati-             One benefit of this approach that we have not
cally improving relative to the baseline scores. In       explored is its application to incorporating context:
the appendix we give examples of 4-best lists gen-        the pronoun disambiguation information could po-
erated with and without gender constraints, high-         tentially come from a previous or subsequent sen-
lighting why this is the case.                            tence. Another benefit is the approach’s flexibility
                                                          to introducing new gendered vocabulary: anything
4.3   Controlling gender output with                      that can be both defined for reranking and disam-
      RoBERTa                                             biguated in the source sentence can be searched for
                                                          in the beam without requiring retraining.
So far we have shown accuracy improvements us-
ing oracle gender, which indicate that the nbest          5   Conclusions
lists often contain a gender-accurate hypothesis.
In Table 3, we explore improvements reranking             We have explored improving gender translation by
with source gender determined automatically by a          correcting beam search. Our experiments demon-
RoBERTa model. There is surprisingly little degra-        strate that applying target language constraints at
dation in WinoMT accuracy when using RoBERTa              inference time can encourage models to produce
coreference resolution compared to the oracle (Ta-        more gender-diverse hypotheses. Moreover we
ble 2). We hypothesise that sentences which are           show that simple reranking heuristics can extract
challenging for pronoun disambiguation will also          more accurate gender translations from these nbest
be among the most challenging to translate, so the        lists using either oracle information or automatic
sentences where RoBERTa disambiguates wrongly             source coreference resolution.
are already ‘wrong’ even with oracle information.            Unlike other approaches to this problem we do
   We note that Saunders and Byrne (2020)                 not attempt to counter unidentified and potentially
achieved 81.7% and 68.4% WinoMT accuracy on               intractable sources of bias in the training data,
German and Spanish respectively after rescoring           or produce new models. However, we view our
with a separate model adapted to a synthetic occu-        schemes as orthogonal to model or data bias miti-
pations dataset. With beam size 20 we are able to         gation. If a model ranks diverse gender translations
achieve very similar results without any change to        higher in the beam initially, finding better gender
our single translation model, even using automatic        translations during beam search becomes simpler.
Impact statement                                           a decoder-based neural machine translation model
                                                           by adding contextual information. In Proceedings
Where machine translation is used in people’s lives,       of the The Fourth Widening Natural Language Pro-
mistranslations have the potential to misrepresent         cessing Workshop, pages 99–102, Seattle, USA. As-
people. This is the case when personal character-          sociation for Computational Linguistics.
istics like social gender conflict with model biases     Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,
towards certain forms of grammatical gender. As            Venkatesh Saligrama, and Adam T Kalai. 2016.
mentioned in the introduction, the result can in-          Man is to computer programmer as woman is to
                                                           homemaker? Debiasing word embeddings. In Ad-
volve implicit misgendering of a user or other hu-         vances in neural information processing systems,
man referent, or perpetuation of social biases about       pages 4349–4357.
gender roles as represented in the translation. A
                                                         Chris Dyer, Victor Chahuneau, and Noah A. Smith.
user whose words are translated with gender de-
                                                           2013. A simple, fast, and effective reparameter-
faults that imply they hold such biased views will         ization of IBM model 2. In Proceedings of the
also be misrepresented.                                    2013 Conference of the North American Chapter of
   We attempt to avoid these failure modes by iden-        the Association for Computational Linguistics: Hu-
tifying translations which are at least consistent         man Language Technologies, pages 644–648, At-
                                                           lanta, Georgia. Association for Computational Lin-
within the translation and with the source sentence.       guistics.
This is dependent on identifying grammatically
gendered terms in the target language – however,         Joel Escudé Font and Marta R. Costa-jussà. 2019.
                                                           Equalizing gender bias in neural machine transla-
this element is very flexible and can be updated for       tion with word embeddings techniques. In Proceed-
new gendered terminology. We note that models              ings of the First Workshop on Gender Bias in Natu-
which do not account for variety in gender expres-         ral Language Processing, pages 147–154, Florence,
sion such as neopronoun use may not be capable             Italy. Association for Computational Linguistics.
of generating appropriate gender translations, but       Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory
in principle a variety of gender translations could        Shakhnarovich. 2013. A systematic exploration of
be extracted from the beam.                                diversity in machine translation. In Proceedings of
                                                           the 2013 Conference on Empirical Methods in Natu-
   By removing the data augmentation, tuning and           ral Language Processing, pages 1100–1111, Seattle,
retraining elements from approaches that address           Washington, USA. Association for Computational
gender translation, we seek to simplify the pro-           Linguistics.
cess and remove additional stages during which           Nizar Habash, Houda Bouamor, and Christine Chung.
bias could be introduced or amplified (Shah et al.,        2019. Automatic gender identification and reinflec-
2020).                                                     tion in Arabic. In Proceedings of the First Work-
                                                           shop on Gender Bias in Natural Language Process-
                                                           ing, pages 155–165, Florence, Italy. Association for
References                                                 Computational Linguistics.

Bashar Alhafni, Nizar Habash, and Houda Bouamor.         Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
  2020. Gender-aware reinflection using linguistically      and Bill Dolan. 2016. A diversity-promoting ob-
  enhanced neural models. In Proceedings of the Sec-        jective function for neural conversation models. In
  ond Workshop on Gender Bias in Natural Language           Proceedings of the 2016 Conference of the North
  Processing, pages 139–150, Barcelona, Spain (On-          American Chapter of the Association for Computa-
  line). Association for Computational Linguistics.         tional Linguistics: Human Language Technologies,
                                                            pages 110–119, San Diego, California. Association
Loïc Barrault, Magdalena Biesialska, Ondřej Bojar,         for Computational Linguistics.
  Marta R. Costa-jussà, Christian Federmann, Yvette
  Graham, Roman Grundkiewicz, Barry Haddow,              Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  Matthias Huck, Eric Joanis, Tom Kocmi, Philipp           dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof            Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Monz, Makoto Morishita, Masaaki Nagata, Toshi-           RoBERTa: A robustly optimized BERT pretraining
  aki Nakazawa, Santanu Pal, Matt Post, and Marcos         approach. arXiv preprint arXiv:1907.11692.
  Zampieri. 2020. Findings of the 2020 conference on
  machine translation (WMT20). In Proceedings of         Lesly Miculicich Werlen and Andrei Popescu-Belis.
  the Fifth Conference on Machine Translation, pages       2017. Using coreference links to improve Spanish-
  1–55, Online. Association for Computational Lin-         to-English machine translation. In Proceedings of
  guistics.                                                the 2nd Workshop on Coreference Resolution Be-
                                                           yond OntoNotes (CORBON 2017), pages 30–40, Va-
Christine Basta, Marta R. Costa-jussà, and José A. R.      lencia, Spain. Association for Computational Lin-
  Fonollosa. 2020. Towards mitigating gender bias in       guistics.
Amit Moryossef, Roee Aharoni, and Yoav Goldberg.          Felix Stahlberg and Bill Byrne. 2019. On NMT search
 2019. Filling gender & number gaps in neural ma-           errors and model errors: Cat got your tongue? In
 chine translation with black-box context injection.        Proceedings of the 2019 Conference on Empirical
 In Proceedings of the First Workshop on Gender             Methods in Natural Language Processing and the
 Bias in Natural Language Processing, pages 49–54,          9th International Joint Conference on Natural Lan-
 Florence, Italy. Association for Computational Lin-        guage Processing (EMNLP-IJCNLP), pages 3356–
 guistics.                                                  3362, Hong Kong, China. Association for Computa-
                                                            tional Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU
 scores. In Proceedings of the Third Conference on        Gabriel Stanovsky, Noah A. Smith, and Luke Zettle-
 Machine Translation: Research Papers, pages 186–           moyer. 2019. Evaluating gender bias in machine
 191, Belgium, Brussels. Association for Computa-           translation. In Proceedings of the 57th Annual Meet-
 tional Linguistics.                                        ing of the Association for Computational Linguistics,
                                                            pages 1679–1684, Florence, Italy. Association for
Altaf Rahman and Vincent Ng. 2012. Resolving com-           Computational Linguistics.
  plex cases of definite pronouns: The Winograd
                                                          Md Arafat Sultan, Shubham Chandel, Ramón Fernan-
  schema challenge. In Proceedings of the 2012 Joint
                                                           dez Astudillo, and Vittorio Castelli. 2020. On the
  Conference on Empirical Methods in Natural Lan-
                                                           importance of diversity in question generation for
  guage Processing and Computational Natural Lan-
                                                           QA. In Proceedings of the 58th Annual Meeting
  guage Learning, pages 777–789, Jeju Island, Korea.
                                                           of the Association for Computational Linguistics,
  Association for Computational Linguistics.
                                                           pages 5651–5656, Online. Association for Compu-
                                                           tational Linguistics.
Nicholas Roberts, Davis Liang, Graham Neubig, and
  Zachary C Lipton. 2020.        Decoding and di-         Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang,
  versity in machine translation.   arXiv preprint          Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth
  arXiv:2011.13477.                                         Belding, Kai-Wei Chang, and William Yang Wang.
                                                            2019. Mitigating gender bias in natural language
Rachel Rudinger, Jason Naradowsky, Brian Leonard,           processing: Literature review. In Proceedings of
  and Benjamin Van Durme. 2018. Gender bias in              the 57th Annual Meeting of the Association for Com-
  coreference resolution. In Proceedings of the 2018        putational Linguistics, pages 1630–1640, Florence,
  Conference of the North American Chapter of the           Italy. Association for Computational Linguistics.
  Association for Computational Linguistics: Human
  Language Technologies, Volume 2 (Short Papers),         Tony Sun, Kellie Webster, Apu Shah, William Yang
  pages 8–14, New Orleans, Louisiana. Association           Wang, and Melvin Johnson. 2021. They, them,
  for Computational Linguistics.                            theirs: Rewriting with gender-neutral english. arXiv
                                                            preprint arXiv:2102.06788.
Danielle Saunders and Bill Byrne. 2020. Reducing gen-
  der bias in neural machine translation as a domain      Eva Vanmassenhove, Christian Hardmeier, and Andy
  adaptation problem. In Proceedings of the 58th An-        Way. 2018. Getting gender right in neural machine
  nual Meeting of the Association for Computational         translation. In Proceedings of the 2018 Conference
  Linguistics, pages 7724–7736, Online. Association         on Empirical Methods in Natural Language Process-
  for Computational Linguistics.                            ing, pages 3003–3008, Brussels, Belgium. Associa-
                                                            tion for Computational Linguistics.
Danielle Saunders, Rosie Sallis, and Bill Byrne. 2020.
                                                          Eva Vanmassenhove, Dimitar Shterionov, and Andy
  Neural machine translation doesn’t translate gender
                                                            Way. 2019. Lost in translation: Loss and decay of
  coreference right unless you make it. In Proceed-
                                                            linguistic richness in machine translation. In Pro-
  ings of the Second Workshop on Gender Bias in Nat-
                                                            ceedings of Machine Translation Summit XVII Vol-
  ural Language Processing, pages 35–43, Barcelona,
                                                            ume 1: Research Track, pages 222–232, Dublin,
  Spain (Online). Association for Computational Lin-
                                                            Ireland. European Association for Machine Transla-
  guistics.
                                                            tion.
Deven Santosh Shah, H. Andrew Schwartz, and Dirk          Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-
  Hovy. 2020. Predictive biases in natural language         cois Chollet, Aidan Gomez, Stephan Gouws, Llion
  processing models: A conceptual framework and             Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par-
  overview. In Proceedings of the 58th Annual Meet-         mar, Ryan Sepassi, Noam Shazeer, and Jakob Uszko-
  ing of the Association for Computational Linguistics,     reit. 2018. Tensor2Tensor for neural machine trans-
  pages 5248–5264, Online. Association for Computa-         lation. In Proceedings of the 13th Conference of the
  tional Linguistics.                                       Association for Machine Translation in the Ameri-
                                                            cas (Volume 1: Research Papers), pages 193–199,
Artūrs Stafanovičs, Toms Bergmanis, and Mārcis Pin-      Boston, MA. Association for Machine Translation
  nis. 2020. Mitigating gender bias in machine trans-       in the Americas.
  lation with target gender annotations. In Proceed-
  ings of the Fifth Conference on Machine Translation     Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  (WMT).                                                    Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
    you need. In Advances in Neural Information Pro-
    cessing Systems, pages 6000–6010.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-
   donez, and Kai-Wei Chang. 2017. Men also like
   shopping: Reducing gender bias amplification using
   corpus-level constraints. In Proceedings of the 2017
   Conference on Empirical Methods in Natural Lan-
   guage Processing, pages 2979–2989, Copenhagen,
   Denmark. Association for Computational Linguis-
   tics.
Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno
  Pouliquen. 2016. The united nations parallel cor-
  pus v1.0. In Proceedings of the Tenth Interna-
  tional Conference on Language Resources and Eval-
  uation (LREC 2016), pages 3530–3534, Portorož,
  Slovenia. European Language Resources Associa-
  tion (ELRA).

A     Constrained vs unconstrained beams
The difference in gender diversity between con-
strained and unconstrained beam search can be ob-
served by examining the beams. Table 4 (next page)
gives examples of 4-best lists for two WinoMT sen-
tences translated into German. The lists are pro-
duced with and without gender constraints, and are
ordered by model likelihood in both cases.
   In both cases the best-scoring hypothesis fails
to resolve gender coreference correctly, motivating
the reranking scheme we describe in this paper.
However, the effects of gender constrained beam
search become evident if we compare the 4-best
lists. With regular unconstrained beam search, the
model simply switches out more common lexical
alternatives, such as different forms of the past
tense. However, the grammatical genders of both
entities in both examples are unchanged throughout
the beam.
   By contrast, using gender-constrained beam
search constrains changes in the nbest list to gen-
dered inflections of the original best-scoring trans-
lation, as we expect. The changes made under
beam search vary the grammatical genders of arti-
cles and entities, resulting in more gender-diverse
beams. We note that in the second example, the
definite articles vary in terms of gender but the
translated profession terms only vary in one case
out of eight. This behaviour is mitigated with larger
beam sizes.
English source    The cook prepared a dish for the teacher because she just learned a new dish.
                   Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
                   Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hatte.
 Unconstrained
                   Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hat.
                   Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht erlernte.
                   Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
                   Der Koch bereitete ein Gericht für den Lehrer, weil sie gerade ein neues Gericht gelernt.
 Constrained
                   Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt.
                   Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade eine neues Gericht gelernt.
 English source    The cook prepared a dish for the teacher because he is hungry.
                   Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
                   Der Koch bereitete dem Lehrer ein Gericht vor, weil er hungrig ist.
 Unconstrained
                   Der Koch bereitete ein Gericht für den Lehrer vor, weil er hungrig ist.
                   Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig.
                   Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
                   Die Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist.
 Constrained
                   Der Koch bereitete ein Gericht für die Lehrer, weil er hungrig ist.
                   Die Köchin bereitete ein Gericht für den Lehrer, weil er hungrig ist.

Table 4: Results of beam-4 search for English-German translation for two WinoMT sentences with and without
gender constraints. 4-best lists are ordered by likelihood under the NMT model.
You can also read