Gender Bias in Machine Translation

Page created by Carrie Crawford
 
CONTINUE READING
Gender Bias in Machine Translation

                                             Beatrice Savoldi1,2 , Marco Gaido1,2 , Luisa Bentivogli2 , Matteo Negri2 , Marco Turchi2
                                                                             1
                                                                               University of Trento
                                                                          2
                                                                            Fondazione Bruno Kessler
                                                        {bsavoldi,mgaido,bentivo,negri,turchi}@fbk.eu

                                                               Abstract                          text, she was repeatedly referred to by masculine
                                                                                                 pronouns. Gender-related concerns have also been
                                              Machine translation (MT) technology has            recently voiced by online MT users, spotting how
                                              facilitated our daily tasks by providing ac-       commercial systems further entrench social gender
                                              cessible shortcuts for gathering, processing
arXiv:2104.06001v2 [cs.CL] 15 Apr 2021

                                                                                                 expectations, e.g. they tend to translate engineers
                                              and communicating information. However,
                                                                                                 as masculine and nurses as feminine (Olson, 2018).
                                              it can suffer from biases that harm users and
                                              society at large. As a relatively new field of        With language technologies entering widespread
                                              inquiry, studies of gender bias in MT still        use and being deployed at a massive scale, their so-
                                              lack cohesion, which advocates for a unified       cietal impact has raised concern both within (Hovy
                                              framework to ease future research. To this         and Spruit, 2016; Bender et al., 2021) and outside
                                              end, we: i) critically review current concep-      (Dastin, 2018) the scientific community. To take
                                              tualizations of bias in light of theoretical in-
                                                                                                 stock of the situation, Sun et al. (2019) reviewed
                                              sights from related disciplines, ii) summarize
                                              previous analyses aimed at assessing gender        NLP studies on the topic. However, their survey is
                                              bias in MT, iii) discuss the mitigating strate-    based on monolingual applications, whose underly-
                                              gies proposed so far, and iv) point toward         ing assumptions and solutions may not be directly
                                              potential directions for future work.              applicable to languages other than English (Zhou
                                                                                                 et al., 2019; Zhao et al., 2020; Takeshita et al.,
                                                                                                 2020) and cross-lingual settings. Moreover, MT
                                         1    Introduction                                       is a multifaceted task, which requires to resolve
                                         Interest in understanding, assessing, and mitigating    multiple gender-related subtasks at the same time
                                         gender bias is steadily growing within the natu-        (e.g. coreference resolution, named entity recogni-
                                         ral language processing (NLP) community, with           tion). Hence, depending on the languages involved
                                         recent studies showing how gender disparities af-       and the factors accounted for, gender bias has been
                                         fect language technologies. Sometimes, for exam-        conceptualized differently across studies. To date,
                                         ple, coreference resolution systems fail to recog-      gender bias in MT has been tackled by means of a
                                         nize women doctors (Zhao et al., 2017; Rudinger         narrow, problem-solving oriented approach. While
                                         et al., 2018), image captioning models do not detect    technical countermeasures are needed, failing to
                                         women sitting next to a computer (Hendricks et al.,     adopt a wider perspective and engage with related
                                         2018), and automatic speech recognition works           literature outside of NLP can be detrimental to the
                                         better with male voices (Tatman, 2017). Despite         advancement of the field (Blodgett et al., 2020).
                                         a prior disregard for such a phenomenon within             In this paper, we intend to put such literature to
                                         research agendas (Cislak et al., 2018), it is now       use for the study of gender bias in MT. We go be-
                                         widely recognized that NLP tools encode and re-         yond surveys restricted to monolingual NLP (Sun
                                         flect controversial social asymmetries for many         et al., 2019) or more limited in scope (Costa-jussà,
                                         seemingly neutral tasks, machine translation (MT)       2019; Monti, 2020), and present the first compre-
                                         included. Admittedly, the problem is not com-           hensive review of gender bias in MT. In particular,
                                         pletely new (Frank et al., 2004). A few years ago,      we 1) offer a unified framework that introduces
                                         Schiebinger (2014) denounced the phenomenon of          the concepts, sources, and effects of bias in MT,
                                         “masculine default” in MT after running one of her      clarified in light of relevant notions on the rela-
                                         interviews through a commercial translation sys-        tion between gender and different languages; 2)
                                         tem. In spite of several feminine mentions in the       critically discuss the state of the research, identi-
fying blind spots as well as present and future key      resentation of women, ii) not recognizing the ex-
challenges.                                              istence of non-binary individuals, and iii) failing
                                                         to reflect their identity and communicative reper-
2   Bias statement                                       toires. Considering the latter case, by fostering the
                                                         visibility of the way of speaking of the dominant
In cognitive science, bias refers to the result of       group, users can presume that such language repre-
psychological heuristics, i.e. mental shortcuts that     sents the most appropriate or prestigious variant.1
can be critical to support prompt reactions (Tversky     Stereotyping regards the propagation of negative
and Kahneman, 1973, 1974). AI research borrowed          generalizations of a social group, e.g. belittling
from such tradition (Rich and Gureckis, 2019; Rah-       feminine representation to less prestigious occu-
wan et al., 2019) and conceived bias as the diver-       pations (teacher (F) vs. lecturer (M)), or in asso-
gence from an ideal or expected value (Glymour           ciation with attractiveness judgments (pretty lec-
and Herington, 2019; Shah et al., 2020), which can       turer (F)). Such behaviors are harmful as they can
occur if models rely on spurious cues and unin-          directly affect the self-esteem of members of the
tended shortcut strategies to predict outputs (Schus-    target group (Bourguignon et al., 2015).
ter et al., 2019; McCoy et al., 2019; Geirhos et al.,
                                                            The ubiquitous embedding of MT in web appli-
2020). Since this can lead to systematic errors or
                                                         cations provides us with paradigmatic examples
adverse social effects, bias investigation is not only
                                                         of how the two types of (R) can interplay. If a
a scientific and technical endeavour, but also an
                                                         woman or non-binary2 scientist is the subject of a
ethical one, given the growing societal role of NLP
                                                         query, automatically translated pages run the risk
applications (Bender and Friedman, 2018).
                                                         of referring to them via masculine-inflected job
   As Blodgett et al. (2020) recently called out, and
                                                         qualifications. In this case, the subject is misrep-
has been endorsed in other venues (Hardmeier et al.,
                                                         resented, leading to potential feelings of identity
2021), analysing bias is an inherently normative
                                                         invalidation (Zimman et al., 2017). Also, users
process, which requires to identify what is deemed
                                                         may not be aware of being exposed to MT mistakes
as an harmful behavior, how and to whom. Hereby,
                                                         due to the deceptively fluent output of a system
drawing on the human-centered approach of Value
                                                         (Martindale and Carpuat, 2018). In the long run,
Sensitive Design (Friedman and Hendry, 2019), we
                                                         stereotypical assumptions and prejudices (e.g. only
consider as biased an MT model that systematically
                                                         men are qualified for high-level positions) might
and unfairly discriminates against certain individ-
                                                         reinforce (Levesque, 2011; Régner et al., 2019).
uals or groups in favour of others (Friedman and
Nissenbaum, 1996). We identify bias per specific            As regards (A), MT services are consumed by
model’s behaviors, which are assessed envisaging         the general public and can thus be regarded as re-
their potential risks when the model is deployed         sources in their own right. Hence, (R) can directly
(Bender et al., 2021) and the harms that could ensue     imply (A) as a performance disparity across users
(Crawford, 2017), with people in focus (Bender,          in the quality of service, i.e. the overall efficiency
2019). Along this line, with MT daily deployed           of the service. Accordingly, a woman attempting
for a range of use-cases by millions of individuals,     to translate her biography by relying on an MT sys-
there are several contexts and people that could be      tem requires additional energy and time to revise
impacted. As a guide, we rely on Crawford (2017),        wrong masculine references. If such disparities are
who defines two main categories of harms produced        not accounted for, the MT field runs the risk of
by a biased system: i) Representational (R) – i.e.       producing systems that prevent certain groups from
detraction from the representation of social groups      fully benefiting from such technological resources.
and their identity, which, in turn, affects attitudes      In the following, we operationalize such cate-
and beliefs; ii) Allocational (A) – i.e. a system        gories to map current studies on gender bias to
allocates or withholds opportunities or resources to     their motivations and societal implications.
certain groups.
   In the MT literature reviewed in this paper, (R)
                                                            1
can be distinguished into under-representation and             For an analogy on how technology shaped the perception
                                                         of feminine voices as shrill and immature see (Tallon, 2019).
stereotyping. The former refers to the reduction of          2
                                                               Throughout the paper, we use non-binary as an umbrella
the visibility of a social group through language,       term for referring to all gender identities between or outside
e.g. by i) producing a disproportionately low rep-       the masculine/feminine binary categories.
3       Understanding Bias                                       derivative nouns (actor/actress) and compounds
                                                                 (chairman/chairwoman).
To confront bias in MT, it is vital to reach out to
                                                                    Grammatical gender languages (e.g. Arabic,
other disciplines that foregrounded how the socio-
                                                                 Spanish). In these languages, each noun pertains
cultural notions of gender interact with language(s),
                                                                 to a class such as masculine, feminine, and neuter
translation, and implicit biases. Afterward, we
                                                                 (if present). Although for most inanimate objects
can discuss the multiple factors that concur to en-
                                                                 gender assignment is just formal,4 for human refer-
code and amplify gender inequalities in language
                                                                 ents masculine/feminine markings are assigned on
technology. Note that, except for (Saunders et al.,
                                                                 a semantic basis. Grammatical gender is defined
2020), current studies on gender bias in MT have
                                                                 by a system of morphosyntactic agreement, where
assumed a (often implicit) binary vision of gender.
                                                                 several parts of speech beside the noun (e.g. verbs,
As such, our discussion is largely forced into such
                                                                 determiners, adjectives) carry gender inflections.
a classification. Although we reiterate on bimodal
                                                                    In light of the above, the English sentence
feminine/masculine linguistic forms and social cat-
                                                                 “she/he is a good friend” has no overt expression of
egories, we emphasize that gender encompasses
                                                                 gender in a genderless language like Turkish (“o
multiple biosocial elements not to be conflated with
                                                                 iyi bir arkadaş”), whereas Spanish spreads several
sex (Risman, 2018; Fausto-Sterling, 2019), and that
                                                                 overt feminine “Ella es una buena amiga” or mas-
some individuals do not experience gender, at all,
                                                                 culine markings “El es un buen amigo”. Although
or in binary terms (Glen and Hurrell, 2012).
                                                                 general, these macro-categories allow us to high-
3.1      Gender and Language                                     light typological differences across languages, cru-
                                                                 cial to frame gender issues in both human and ma-
The relation between language and gender is not
                                                                 chine translation. Also, they exhibit to what extent
straightforward. First, the linguistic structures used
                                                                 speakers of each group are led to think and commu-
to refer to the extra-linguistic reality of gender vary
                                                                 nicate via binary distinctions,5 as well as underline
across languages (§3.1.1). Moreover, how gender
                                                                 the relative complexity in carving out a space for
is assigned and perceived in our verbal practices de-
                                                                 lexical innovations which encode non-binary gen-
pends on contextual factors as well as assumptions
                                                                 der (Hord, 2016; Conrod, 2020). In this sense,
about social roles, traits, and attributes (§3.1.2). At
                                                                 while English is attempting to bring the singular
last, language is conceived as a tool for articulating
                                                                 they in common use and developing neo-pronouns
and constructing personal identities (§3.1.3).
                                                                 (Bradley et al., 2019), for grammatical gender lan-
3.1.1 Linguistic Encoding of Gender                              guages like Spanish neutrality requires to develop
Drawing on (Corbett, 1991; Craig, 1994; Comrie,                  neo-morphemes (“Elle es une buene amigue”).
1999; Hellinger and Bußman, 2001, 2002, 2003;
                                                                 3.1.2    Social Gender Connotations
Corbett, 2013; Gygax et al., 2019) we hereby de-
scribe the linguistic forms (lexical, pronominal,                To understand gender bias, we have to grasp not
grammatical) that bear a relation with the extra-                only the structure of different languages, but also
linguistic reality of gender. Following Stahlberg                how linguistic expressions are connoted, deployed,
et al. (2007), we identify three language groups:                and perceived (Hellinger and Motschenbacher,
   Genderless languages (e.g. Finnish, Turkish).                 2015). In grammatical gender languages, feminine
In such languages, the gender-specific repertoire                forms are often subject to a so-called semantic dero-
is at its minimum, only expressed for basic lexical              gation (Schulz, 1975), e.g. in French, couturier
pairs, usually kinship or address terms (e.g. in                 (fashion designer) vs. couturière (seamstress). En-
Finnish sisko/sister vs. veli/brother).                          glish is no exception (e.g. major/majorette).
   Notional gender languages3 (e.g. Danish, En-                     Bias can creep in also in a covert matter, as in the
glish). On top of lexical gender (mom/dad), such                 case of epicene (i.e. gender neutral) nouns where
languages display a system of pronominal gender                  gender is not grammatically marked. Here, gender
(she/he, her/him). English also hosts some marked                assignment is linked to (typically binary) social
                                                                 gender, i.e. “the socially imposed dichotomy of
    3
     Also referred to as natural gender languages. Following
                                                                    4
(McConnell-Ginet, 2013), we prefer notional to avoid termino-        E.g. “moon” is masculine in German, feminine in French.
                                                                    5
logical overlapping with “natural”, i.e. biological/anatomical       Outside of the Western paradigm, there are cultures whose
sexual categories. For a wider discussion on the topic see       languages traditionally encode gender outside of the binary
(Nevalainen and Raumolin-Brunberg, 1993; Curzan, 2003).          (Epple, 1998; Murray, 2003; Hall and O’Donovan, 2014).
masculine and feminine role and character traits”       dorf, 2002; Brownlow et al., 2003). Although some
(Kramarae and Treichler, 1985). As an illustra-         correspondences between gender and linguistic fea-
tion, Danish speakers tend to pronominalize dom-        tures hold across cultures and languages (Smith,
mer (judge) with han (he) when referring to the         2003; Johannsen et al., 2015), it should be kept in
whole occupational category (Gomard, 1995; Nis-         mind that they are far from universal6 and should
sen, 2002). Social gender assignment varies across      not be intended in a stereotypical and oversimplis-
time and space (Lyons, 1977; Romaine, 1999;             tic manner (Bergvall et al., 1996; Nguyen et al.,
Cameron, 2003) and regards stereotypical assump-        2016; Koolen and van Cranenburgh, 2017).
tions about what is typical or appropriate for men         Drawing on gender-related features proved use-
and women. Such assumptions impact our percep-          ful to build demographically informed NLP tools
tions (Hamilton, 1988; Gygax et al., 2008; Kreiner      (Garimella et al., 2019) and personalized MT mod-
et al., 2008) and influence our behavior – e.g. lead-   els (Mirkin et al., 2015; Bawden et al., 2016; Ra-
ing individuals to identify with and fulfill stereo-    binovich et al., 2017). However, using personal
typical expectations (Wolter and Hannover, 2016;        gender as a variable requires a prior understanding
Sczesny et al., 2018) – and verbal communication,       of which categories may be salient, and a critical
e.g. women are often misquoted in the academic          reflection on how gender is intended and ascribed
community (Krawczyk, 2017).                             (Larson, 2017). Otherwise, if we assume that the
   Translation studies highlight how social gender      only relevant (sexual) categories are “male” and
assignment influences translation choices (Jakob-       “female”, our models will inevitably fulfill such a
son, 1959; Chamberlain, 1988; Comrie, 1999;             reductionist expectation (Bamman et al., 2014).
Di Sabato and Perri, 2020). Primarily, the prob-
lem arises from typological differences across          3.2    Gender Bias in MT
languages and their gender systems; nonetheless,        To date, an overview of how several factors may
socio-cultural factors influence how translators deal   contribute to gender bias in MT does not exist. We
with such differences. Consider the character of the    identify and clarify concurring problematic causes,
cook in Daphne du Maurier’s “Rebecca”, whose            accounting for the context in which systems are
gender is never explicitly stated in the whole book.    developed and used (§2). To this aim, we rely on
In the lack of any available information, translators   the three overarching categories of bias described
into five grammatical gender languages differently      by Friedman and Nissenbaum (1996), which fore-
represented the character as a man or a woman           ground different sources that can lead to the mani-
(Wandruszka, 1969; Nissen, 2002). Although ex-          festation of machine bias. These are: pre-existing
treme, this case represents to a certain extent the     bias – rooted in our institutions, practices and at-
situation of uncertainty faced by MT: the mapping       titudes (§3.2.1), technical bias – due to technical
of one-to-many forms in gender prediction. But,         constraints and decisions (§3.2.2), and emergent
as discussed in §4.1, mistranslations occur when        bias – arising in the context of interaction with
contextual gender information is available, too.        users (§3.2.3). Rather than discretely, we consider
                                                        such categories as placed in a continuum.
3.1.3   Gender and Language Use
                                                        3.2.1 Pre-existing Bias
Language use varies between demographic groups
and reflects their backgrounds, personalities, and      MT models are known to reflect gender dispari-
social identities (Labov, 1972; Trudgill, 2000; Pen-    ties present in the data. However, reflections on
nebaker and Stone, 2003). In this light, the study of   such generally invoked disparities are often over-
gender and language variation has received much         looked. Treating data as an abstract, monolithic
attention in socio- and corpus linguistics (Holmes      entity (Gitelman, 2013) – or relying on “overly
and Meyerhoff, 2003; Eckert and McConnell-Ginet,        broad/overloaded terms like training data bias”7
2013). Research conducted in speech and text            (Suresh and Guttag, 2019) – does not encourage
analysis highlighted several gender differences,           6
                                                              It has been largely debated whether gender-related differ-
which are exhibited at the phonological and lexical-    ences are inherently biological or cultural and social products
syntactic level. For example, women rely more           (Mulac et al., 2001). Currently, the idea that they depend on
                                                        biological reasons is largely rejected (Hyde, 2005) in favour
on hedging strategies (“it seems that”), purpose        of a socio-cultural or performative perspective (Butler, 1990).
clauses (“in order to”), first-person pronouns, and         7
                                                              See (Johnson, 2020a; Samar, 2020) for a discussion on
prosodic exclamations (Mulac et al., 2001; Mon-         how such narrative can be counterproductive for tackling bias.
reasoning on the many factors of which data are the     puts. As datasets are a crucial source of bias, this
product. First and foremost, the historical, socio-     advocates for careful data curation (Mehrabi et al.,
cultural context in which they are generated.           2019; Paullada et al., 2020; Hanna et al., 2021;
   A starting point to tackle these issues is the       Bender et al., 2021), guided by pragmatically- and
Europarl corpus (Koehn, 2005), where only 30%           socially-informed analysis (Hitti et al., 2019; Sap
of sentences are uttered by women (Vanmassen-           et al., 2020; Devinney et al., 2020) and annotation
hove et al., 2018). Such kind of imbalance is a         practices (Gaido et al., 2020).
direct window into the glass ceiling that has ham-         Overall, while data can mirror gender inequal-
pered women’s access to parliamentary positions.        ities and offer adverse shortcut learning opportu-
This case exemplifies how data might be “tainted        nities, it is “quite clear that data alone rarely con-
with historical bias”, mirroring an “unequal ground     strain a model sufficiently” (Geirhos et al., 2020)
truth” (Hacker, 2018). However, other gender vari-      nor explain the fact that models overamplify (Shah
ables are harder to spot and quantify.                  et al., 2020) such inequalities in their outputs. Fo-
   Empirical research in linguistics pointed out that   cusing on models’ components, Costa-jussà et al.
subtle gender asymmetries are rooted in languages’      (2020b) demonstrate that architectural choices in
use and structure. For instance, an important aspect    multilingual MT impact systems’ behavior: shared
regards how women are referred to. Femaleness is        encoder-decoders retain less gender information in
often explicitly invoked when there is no textual       the source embeddings and less diversion in the
need to do so, even in languages that do not require    attention than language-specific encoder-decoders
overt gender marking. A case in point regards           (Escolano et al., 2021), thus disfavouring the gen-
Turkish, which differentiates cocuk (child) and kiz     eration of feminine forms. While discussing the
cocugu (female child) (Braun, 2000). Similarly, in      loss and decay of certain words in translation, Van-
a corpus search, Romaine (2001) found 155 explicit      massenhove et al. (2019, 2021) attest the existence
female markings for doctor (female, woman or lady       of an algorithmic bias that leads those forms that
doctor), compared to only 14 male doctor. Feminist      are underrepresented in the training data – as it may
language critique provided extensive analysis of        be the case for feminine references – to further de-
such a phenomenon by highlighting how referents         crease in the MT ouput. Specifically, Roberts et al.
in discourse are considered men by default unless       (2020) prove that beam search – unlike sampling –
explicitly stated (Silveira, 1980; Hamilton, 1991).     is skewed toward the generation of more frequent
Finally, prescriptive top-down guidelines limit the     (masculine) pronouns, as it leads models to an ex-
linguistic visibility of gender diversity, e.g. the     treme operating point that exhibits zero variability.
Real Academia de la Lengua Española recently              Thus, efforts towards understating and mitigat-
discarded the official use of non-binary innovations    ing gender bias should also account for the model
and claimed the functionality of masculine generics     front. To date, this remains largely unexplored.
(Mundo, 2018; López et al., 2020).
                                                        3.2.3   Emergent Bias
   By stressing such issues, we are not condoning
the reproduction of pre-existing bias in MT. Rather,    Emergent bias may arise when a system is used
the above-mentioned concerns are the starting point     in a different context than the one it was designed
to account for when dealing with gender bias.           for, e.g. when it is applied to another demographic
                                                        group. From car crash dummies to clinical trials,
3.2.2   Technical Bias                                  we have evidence of how not accounting for gender
Technical bias comprises aspects related to data        differences brings to the creation of male-grounded
creation, models’ design, training and testing pro-     products with dire consequences (Liu and Dipi-
cedures. If present in training and testing samples,    etro Mager, 2016; Criado-Perez, 2019), such as
asymmetries in the semantics of language use and        higher death and injury risks in vehicle crash and
in gender distribution are respectively learnt by MT    less effective medical treatments for women. Simi-
systems and rewarded in their evaluation. However,      larly, unbeknownst to their creators, MT systems
as just discussed, biased representations are not       that are not envisioned for a diverse range of users
merely quantitative, but also qualitative. Accord-      will not generalize for the feminine segment of
ingly, straightforward procedures – e.g. balancing      the population. Hence, in the interaction with an
the number of speakers in existing datasets – do not    MT system, a woman will likely be misgendered
ensure a fairer representation of gender in MT out-     or not have her linguistic style preserved (Hovy
et al., 2020). Other conditions of users/system                weight of prejudices and stereotypes in MT (§4.1);
mismatch may be the result of changing societal                ii) studies assessing whether gender is properly pre-
knowledge and values. A case in point regards                  served in translation (§4.2). To keep the connection
Google Translate’s historical decision to adjust its           with the human-centered approach embraced in this
system for instances of gender ambiguity. Since its            survey, in Table 1 we map each work to the harms
launch twenty years ago, Google had provided only              (see §2) ensuing from the biased behaviors they
one translation for single-word gender-ambiguous               assess. Finally, we review existing benchmarks for
queries (e.g. the English professor translated in Ital-        comparing MT performance across genders (§4.3).
ian with the masculine professore). In a community
increasingly conscious about the power of language             4.1    MT and Gender Stereotypes
to hardwire stereotypical beliefs and women’s in-              In MT, we record prior studies concerned with pro-
visibility (Lindqvist et al., 2019; Beukeboom and              noun translation and coreference resolution across
Burgers, 2019), the bias exhibited by the system               typologically different languages accounting for
was confronted with a new sensitivity. The ser-                both animate and inanimate referents (Hardmeier
vice’s announcement (Kuczmarski, 2018) to pro-                 and Federico, 2010; Le Nagard and Koehn, 2010;
vide a double feminine/masculine output (profes-               Guillou, 2012). For the specific analysis on gender
sor→professoressa|professore) stems from current               bias, instead, such tasks are exclusively studied in
demands for gender-inclusive resolutions. For the              relation to human entities.
recognition of non-binary groups (Richards et al.,                Prates et al. (2018) and Cho et al. (2019) de-
2016), we invite to study how such modeling could              sign a similar setting to assess gender bias. Prates
be integrated with neutral strategies (§6).                    et al. (2018) investigate pronoun translation from
                                                               12 gender neutral languages into English. Retriev-
4       Assessing Bias                                         ing ∼1,000 job positions from the U.S. Bureau of
First accounts on gender bias in MT date back to               Labor statistics, they build simple constructions
Frank et al. (2004). Their manual analysis pointed             like the Hungarian “ő egy mérnök” (“he/she is an
out how English-German MT suffers from a dearth                engineer”). Following the same template, Cho et al.
at the linguistic level, observing severe difficulties         (2019) extend the analysis to Korean-English in-
in recovering syntactic and semantic information               cluding both occupations and sentiment words (e.g.
to correctly produce gender agreement.                         kind). As their samples are ambiguous by design,
   Akin investigations were conducted on other tar-            the observed predictions of he/she pronouns should
get grammatical gender languages for several com-              be basically a random guess, yet they show a gen-
mercial MT systems (Abu-Ayyash, 2017; Monti,                   eral strong masculine skew.9 To further analyze
2017; Rescigno et al., 2020). While these stud-                the under-representation of she pronouns, Prates
ies focused on contrastive phenomena, Schiebinger              et al. (2018) focus on 22 macro-categories of occu-
(2014)8 went beyond linguistic insights, calling for           pation areas (e.g. STEM, communication, admin-
a deeper understanding of gender bias. Her article             istration) and compare the proportion of pronoun
on Google Translate’s “masculine default” behav-               predictions against the real-world proportion of
ior emphasized how such phenomenon is related to               men and women employed in such sectors. In this
a larger discussion on gender inequalities, also per-          way, they see that MT not only yields a masculine
petuated by socio-technical artifacts (Selbst et al.,          default, but also underestimates feminine frequency
2019). All in all, these qualitative analyses demon-           at a greater rate than occupation data alone suggest.
strated that gender problems encompass all three               Such an analysis attests the existence of machine
MT paradigms (neural, statistical, and rule-based),            bias, and defines it as the exacerbation of actual
preparing the ground for quantitative work.                    gender disparities.
   To attest the existence and scale of gender bias               Going beyond word lists and simple synthetic
across several languages, dedicated benchmarks,                constructions, Gonen and Webster (2020) inspect
evaluations, and experiments have been designed.               the translation into Russian, Spanish, German, and
We first discuss large scale analyses aimed at assess-            9
                                                                    Cho et al. (2019) also recognize that a potentially higher
ing gender bias in MT, grouped according to two                frequency of feminine references in the MT output would
main conceptualizations: i) works focusing on the              not necessarily imply that the problem of bias is alleviated.
                                                               Rather, it may reflect gender stereotypes, as for hairdresser
    8
        See also Schiebinger’s project Gendered Innovations.   that is skewed toward feminine.
French of natural, but still ambiguous, English sen-         gains in terms of overall quality when translat-
tences. Their analysis on the ratio and type of gen-         ing into grammatical gender languages, where
erated masculine/feminine job titles consistently            speaker’s references are often marked. For in-
exhibits social asymmetries for target grammati-             stance, the French translation of “I’m happy” is
cal gender languages (e.g. lecturer masculine vs.            either “Je suis heureuse“ or “Je suis hereux” for a
teacher feminine). Finally, Stanovsky et al. (2019)          female/male speaker respectively. With more fo-
asses that MT is skewed to the point of actually             cused cross-gender analysis – carried out by split-
ignoring explicit feminine gender information in             ting their English-French test set into 1st person
source English sentences. For instance, MT sys-              male vs. female data – they assess that the largest
tems yield a wrong masculine translation of the job          margin of improvement for their gender-informed
title baker, albeit it is referred to by the pronoun         approach concerns sentences uttered by women,
she. Beside the overlook of overt gender mentions,           as the results of their baseline disclose a disparity
the model’s reliance on unintended (and irrelevant)          in favour of men speakers. Note that the authors
cues for gender assignment is further confirmed by           rely on manual analysis to ascribe performance dif-
the fact that adding a socially connoted – but for-          ferences to gender-related features. In fact, global
mally epicene – adjective (the pretty baker) pushes          evaluations on generic test sets alone are inadequate
models toward feminine inflections in translation.           to pointedly measure gender bias.

4.2    MT and Gender Preservation                            4.3    Existing Benchmarks
                                                             MT outputs are typically evaluated against refer-
Instead of analysing the weight of prejudices and
                                                             ence translations by means of standard metrics such
stereotypes, Vanmassenhove et al. (2018) and Hovy
                                                             as BLEU (Papineni et al., 2002) or TER (Snover
et al. (2020) investigate whether speaker’s gender10
                                                             et al., 2006). This procedure poses two challenges.
is properly reflected in MT. This line of research is
                                                             First, these metrics provide coarse-grained scores
preceded by findings on gender personalization of
                                                             for translation quality, treating all errors equally
statistical MT (Mirkin et al., 2015; Bawden et al.,
                                                             and being rather insensitive to specific linguistic
2016; Rabinovich et al., 2017), claiming that gen-
                                                             phenomena (Sennrich, 2017). Second, generic test
der “signals” are weakened in translation.
                                                             sets containing the same gender imbalance present
   Hovy et al. (2020) conjecture the existence of an
                                                             in the training data can actually reward biased pre-
age and gender stylistic bias due to models’ under-
                                                             dictions. Hereby, we describe the publicly avail-
exposure to the writings of women and younger
                                                             able MT Gender Bias Evaluation Testsets (GBETs)
segments of the population. To test this hypoth-
                                                             (Sun et al., 2019), i.e. benchmarks designed to
esis, they automatically translate a corpus of on-
                                                             probe gender bias by isolating the impact of gender
line reviews with available metadata about users
                                                             from other factors that may affect systems’ perfor-
(Hovy et al., 2015). Then, they compare such de-
                                                             mance. Note that different benchmarks and met-
mographic information with the prediction of age
                                                             rics respond to different conceptualizations of bias
and gender classifiers run on the MT output. Re-
                                                             (Barocas et al., 2019). Common to them all in MT,
sults indicate that different commercial MT models
                                                             however, is that biased behaviors are formalized
systematically make authors sound older and male.
                                                             using some variants of averaged performance11 dis-
However, the authors do not inspect which stylistic
                                                             parities cross gender groups, comparing the accu-
features and linguistic choices MT overproduces.
                                                             racy of gender predictions on an equal number of
   In a similar vein, Vanmassenhove et al. (2018)
                                                             masculine, feminine, and neutral references.
probe MT’s ability to preserve speaker’s gender
                                                                Escudé Font and Costa-jussà (2019) developed
translating from English into ten languages. To this
                                                             the bilingual English-Spanish Occupations test
aim, they develop gender-informed MT models
                                                             set, consisting of 1,000 sentences equally dis-
(see § 5.1), whose outputs are compared with those
                                                             tributed across genders. The phrasal structure
obtained by their baseline counterparts. Tested
                                                             envisioned for their sentences is “I’ve known
on a set for spoken language translation (Koehn,
                                                             {her|him|} for a long time, my
2005), their enhanced models show consistent
                                                               11
                                                                  This is a value-laden option (Birhane et al., 2020), and
  10
     Note that these studies distinguish speakers into fe-   not the only possible one (Mitchell et al., 2020). For a broader
male/male. As discussed in §3.1.3, we invite a reflection    discussion on measurement and bias we refer the reader also
on the appropriateness and use of such categories.           to (Jacobs, 2021; Jacobs et al., 2020).
Study                          Benchmark                                        Gender    Harms
     (Prates et al., 2018)          Synthetic, U.S. Bureau of Labor Statistics       b         R: under-rep, stereotyping
     (Cho et al., 2019)             Synthetic equity evaluation corpus (EEC)         b         R: under-rep, stereotyping
     (Gonen and Webster, 2020)      BERT-based perturbations on natural sentences    b         R: under-rep, stereotyping
     (Stanovsky et al., 2019)       WinoMT                                           b         R: under-rep, stereotyping
     (Vanmassenhove et al., 2018)   Europarl (generic)                               b         A: quality
     (Hovy et al., 2020)            Trustpilot (reviews with gender and age)         b         R: under-rep, A: quality

Table 1: For each Study, the Table shows on which Benchmark gender bias is assessed, how Gender is intended (here only
in binary (b) terms). Finally, we indicate which (R)epresentational – under-representation and stereotyping – or (A)llocational
Harm – as reduced quality of service – is addressed in the study.

friend works as {a|an} ”. The evalu-                 synthetic gender-related phenomena, they do not
ation focuses on the translation of the noun friend              represent the actual challenges posed by real-world
into Spanish (amigo/amiga). Since gender informa-                language and are relatively easy to overfit. In other
tion is present in the source context and sentences              words, as recognized by Rudinger et al. (2018)
are the same for both masculine and feminine par-                “they may demonstrate the presence of gender bias
ticipants, an MT system exhibits gender bias if it               in a system, but not prove its absence”.
cannot provide the correct translation of friend at                 The Arabic Parallel Gender Corpus (Habash
the same rate across genders.                                    et al., 2019) includes an English-Arabic test set
   Stanovsky et al. (2019) created WinoMT by                     retrieved from OpenSubtitles natural language data
concatenating two existing English GBETs for                     (Lison and Tiedemann, 2016). Each of the 2,448
coreference resolution (Rudinger et al., 2018; Zhao              sentences in the set exhibits a first person sin-
et al., 2018a). The corpus consists of 3,888 Wino-               gular reference to the speaker (e.g. “I’m rich”).
gradesque sentences presenting two human entities                Among them, ∼200 English sentences require gen-
defined by their role and a subsequent pronoun that              der agreement to be assigned in translation. These
needs to be correctly resolved to one of the entities            were translated into Arabic in both gender forms,
(e.g. “The lawyer yelled at the hairdresser because              obtaining a quantitatively and qualitatively equal
he did a bad job”). For each sentence, there are                 amount of sentence pairs with annotated mascu-
two variants with either he or she pronouns, so as               line/feminine references. This natural corpus thus
to cast the referred annotated entity (hairdresser)              allows for cross-gender evaluations on MT produc-
into a proto- or antistereotypical gender role. By               tion of correct speaker’s gender agreement.
translating WinoMT into grammatical gender lan-                     MuST-SHE (Bentivogli et al., 2020) is a natu-
guages, one can thus measure system’s ability to                 ral benchmark for three language pairs (English-
resolve the anaphorical relation and pick the correct            French/Italian/Spanish). Built on TED talks data
feminine/masculine inflection for the occupational               (Cattoni et al., 2021), for each language pair it
noun. Also, it allows to verify if MT predictions                comprises ∼1,000 (audio, transcript, translation)
correlate with stereotyping.                                     triplets, thus allowing evaluation for both MT and
   Finally, Saunders et al. (2020) enriched the origi-           speech translation (ST). Its samples are balanced
nal version of WinoMT in two different ways. First,              between masculine and feminine phenomena, and
by including a third gender neutral case based on                incorporate two types of constructions: i) sen-
the singular they pronoun, which paves the way for               tences referring to the speaker (e.g. “I was born
accounting also for non-binary referents. Second,                in Mumbai”), and ii) sentences that present con-
by labeling the entity in the sentence which is not              textual information to disambiguate gender (e.g.
coreferent with the pronoun (lawyer). The latter                 “My mum was born in Mumbai”). Since every
annotation is used to verify the shortcomings of                 gender-marked word is annotated in the corpus,
some mitigating approaches as discussed in §5.                   MuST-SHE grants the advantage of complementing
   The above-mentioned corpora are said to be chal-              BLEU- and accuracy-based evaluations on gender
lenge sets, consisting of sentences created ad hoc               translation for a great variety of phenomena.
for diagnostic purposes. In this way, they can be                   Unlike challenge sets, these natural corpora
used to quantify bias in the context of stereotyp-               quantify whether MT yields reduced feminine rep-
ing and under-representation in a sound environ-                 resentation in authentic conditions and whether the
ment. However, consisting of a limited variety of                quality of service varies across speakers of different
genders. However, as they treat all gender-marked         (e.g. based on first names) is not advisable, as it
words equally, it is not possible to identify if the      runs the risks of introducing additional bias by mak-
model is propagating stereotypical representations.       ing unlicensed assumptions about one’s identity.
All in all, we stress that each test set and metric         Elaraby et al. (2018) bypasses this risk by defin-
is only a proxy for framing a phenomenon or an            ing a comprehensive set of cross-lingual gender
ability (e.g. anaphora resolution), and an approxi-       agreement rules based on POS tagging. In this
mation of what we truly intend to gauge. Thus, as         way, they identify the speakers’ and listeners’ gen-
we discuss in §6, advances in MT should account           der references in an English-Arabic parallel corpus,
for the observation of gender bias in real-world con-     which is consequently labelled and used for train-
ditions, as to avoid that achieving high scores on        ing. However, such approach is not directly scal-
a mathematically formalized esteem could lead to          able to other languages, as it would require creating
a false sense of security. Still, benchmarks remain       new dedicated rules. Moreover, in realistic deploy-
valuable tools to monitor model’s behavior. As            ment conditions where reference translations are
such, we remark that evaluation procedures ought          not available, this information still has to be exter-
to cover both models’ general performance and             nally supplied as metadata at inference time.
gender-related issues. This is crucial to establish
                                                             Stafanovičs et al. (2020) and Saunders et al.
the capabilities and limits of mitigating strategies.
                                                          (2020) explore the use of word-level gender tags.
                                                          While Stafanovičs et al. (2020) just report a gen-
5     Mitigating Bias
                                                          der translation improvement, Saunders et al. (2020)
To attenuate gender bias in MT, different strategies      rely on the expanded version of WinoMT to iden-
dealing with input data, learning algorithms, and         tify a problem concerning gender tagging: it intro-
model outputs have been proposed. As attested             duces noise if applied to sentences with references
by Birhane et al. (2020), since advancements are          to multiple participants, as it pushes their transla-
oftentimes exclusively reported in terms of values        tion toward the same gender. Saunders et al. (2020)
internal to the machine learning field (e.g efficiency,   also include a first non-binary exploration of neu-
performance), it is not clear how such strategies         tral translation by exploiting an artificial dataset,
are meeting societal needs by reducing MT-related         where neutral tags are added and gendered inflec-
harms. In order to conciliate technical perspectives      tions are replaced by placeholders. The results are
with the intended social purpose, in Table 2 we           however inconclusive, most likely due to the small
map each mitigating approach to the harms (see            size and synthetic nature of their dataset.
§2) they are meant to alleviate, as well as on which         Adding context. Without further information
benchmark their effectiveness is evaluated. Com-          needed for training or inference, Basta et al. (2020)
plementarily, we hereby describe each approach by         adopt a generic approach and concatenate each
means of two categories: model debiasing (§5.1)           sentence with its preceding one. By providing more
and debiasing through external components (§5.2).         context, they attest a slight improvement for gender
                                                          translations requiring anaphoric coreference to be
5.1    Model Debiasing                                    solved in English-Spanish. This finding motivates
This line of work focuses on mitigating gender bias       exploration at the document level, but it should be
through architectural changes of general-purpose          validated with manual (Castilho et al., 2020) and
MT models or via dedicated training procedures.           interpretability analyses, as the added context can
   Gender tagging. To improve the generation              be beneficial for gender-unrelated reasons, such as
of speaker’s referential markings, Vanmassenhove          acting as a regularization factor (Kim et al., 2019).
et al. (2018) prepend a gender tag (M or F) to each          Debiased word embeddings. The two above-
source sentence, both at training and inference time.     mentioned mitigations converge on the same intent:
As their model is able to leverage this additional        supply the model with additional gender knowl-
information, the approach proves useful to handle         edge. Instead, Escudé Font and Costa-jussà (2019)
morphological agreement when translating from             leverage pre-trained word embeddings, which are
English into French. However, this solution re-           debiased using the hard-debiasing method pro-
quires additional metadata regarding the speakers’        posed by Bolukbasi et al. (2016) or the GN-
gender that might not always be feasible to ac-           GLOVE algorithm (Zhao et al., 2018b). With
quire. Automatic annotation of speakers’ gender           these methods, gender associations are respectively
Approach              Authors                           Benchmark                          Gender    Harms
Gender tagging        Vanmassenhove et al.              Europarl (generic)                 b         R: under-rep, A: quality
(sentence-level)      Elaraby et al.                    Open subtitles (generic)           b         R: under-rep, A: quality
Gender tagging        Saunders et al.                   expanded WinoMT                    nb        R: under-rep, stereotyping
(word-level)          Stafanovičs et al.               WinoMT                             b         R: under-rep, stereotyping
Adding context        Basta et al.                      WinoMT                             b         R: under-rep, stereotyping
Word-embeddings       Escudé Font and Costa-jussà     Occupation test set                b         R: under-rep
Fine-tuning           Costa-jussà and de Jorge         WinoMT                             b         R: under-rep, stereotyping
Black-box injection   Moryossef et al.                  Open subtitles (selected sample)   b         R: under-rep, A: quality
Lattice-rescoring     Saunders and Byrne                WinoMT                             b         R: under-rep, steretoyping
Re-inflection         Habash et al.; Alhafni et al.     Arabic Parallel Gender Corpus      b         R: under-rep, A: quality

Table 2: For each Approach and related Authors, the Table shows on which Benchmark it is tested, if Gender is intended
in binary terms (b), or including non-binary (nb) identities. Finally, we indicate which (R)epresentational – under-representation
and stereotyping – or (A)llocational Harm – as reduced quality of service – the approach attempts to mitigate.

removed or isolated from the representations of                    approaches do not imply retraining, but introduce
English gender-neutral words. Escudé Font and                     the additional cost of maintaining separate modules
Costa-jussà (2019) experiment using such embed-                   and handling their integration with the MT model.
dings on the decoder side, the encoder side, and                      Black-box injection. Moryossef et al. (2019)
both sides of an English-Spanish model. The best                   attempt to control the production of feminine refer-
results are obtained by leveraging GN-GLOVE em-                    ences to the speaker and numeral inflections (plural
beddings on both encoder and decoder sides, in-                    or singular) for the listener(s) in an English-Hebrew
creasing BLEU scores and gender accuracy. The                      setting. To this aim, they rely on a short construc-
authors generically apply debiasing methods de-                    tion, such as “she said to them”, which is prepended
veloped for English also to their target language.                 to the source sentence and then removed from the
However, being Spanish a grammatical gender lan-                   MT output. Their approach is simple, it can han-
guage, other language-specific approaches should                   dle two types of information (gender and number)
be considered to preserve the quality of the original              for multiple entities (speaker and listener), and im-
embeddings (Zhou et al., 2019; Zhao et al., 2020).                 proves systems’ ability to generate feminine target
We also stress that it is debated whether depriving                forms. However, as in the case of (Vanmassen-
systems of some knowledge and “blind” their per-                   hove et al., 2018; Elaraby et al., 2018), it requires
ceptions is the right path toward fairer language                  metadata about speakers and listeners.
models (Dwork et al., 2012; Caliskan et al., 2017;                    Lattice re-scoring. Saunders and Byrne (2020)
Gonen and Goldberg, 2019; Nissim and van der                       propose to post-process the MT output with a lat-
Goot, 2020). Also, Goldfarb-Tarrant et al. (2021)                  tice re-scoring module. This module exploits a
find that there is no reliable correlation between in-             transducer to create a lattice by mapping gender
trinsic evaluations of bias in word-embeddings and                 marked words in the MT output to all their possible
cascaded effects on MT models’ biased behavior.                    inflectional variants. Developed for German, Span-
   Balanced fine-tuning. Finally, Costa-jussà and                 ish, and Hebrew, all the sentences corresponding
de Jorge (2020) rely on Gebiotoolkit (Costa-jussà                 to the paths in the lattice are re-scored with another
et al., 2020c) to build gender-balanced datasets (i.e.             model, which has been gender-debiased, but at the
featuring an equal amount of masculine/feminine                    cost of lower generic translation quality. Then, the
references) based on Wikipedia biographies. By                     sentence with the highest probability is picked as
fine-tuning their models on such natural and more                  the final output. When tested on WinoMT, such
representative data, the generation of feminine                    an approach notably leads to an increase in the ac-
forms is overall improved. However, the approach                   curacy of gender forms selection. Note that the
is not as effective for gender translation on the anti-            gender-debiased system is created by fine-tuning
stereotypical WinoMT set.                                          the model on an ad-hoc built tiny set containing
                                                                   a balanced amount of masculine/feminine forms.
5.2    Debiasing through External Components                       Such an approach, also known as counterfactual
Instead of directly debiasing the MT model, these                  data augmentation (Lu et al., 2020), requires cre-
mitigating strategies intervene in the inference                   ating identical pairs of sentences differing only
phase with external dedicated components. Such                     in terms of gender references. In fact, Saunders
and Byrne (2020) compile English sentences fol-                   terventions alone are not a panacea (Chang, 2019)
lowing this schema: “The  finished                    and should be integrated with long-term multidisci-
 work”. Then, the sentences are auto-                    plinary commitment and practices (D’Ignazio and
matically translated and manually checked. In this                Klein, 2020; Gebru, 2020) necessary to address
way, they obtain a gender-balanced parallel corpus.               bias in our community, hence in its artifacts, too.
Thus, to implement their method for other language
pairs, the generation of new data is necessary. Al-               6   Conclusion and Key Challenges
though for their fine-tuning set the effort required
is limited, data augmentation can be very costly for              As disparate studies confronting gender bias in
complex sentences representing a richer variety of                MT are rapidly emerging, in this paper we pre-
gender agreement phenomena.12                                     sented them within a unified framework to criti-
   Gender re-inflection. Habash et al. (2019)                     cally overview current conceptualizations and ap-
and Alhafni et al. (2020) confront the problem                    proaches to the problem. Since gender bias is a
of speaker’s gender agreement in Arabic with a                    multifaceted and interdisciplinary issue, in our dis-
post-processing component that re-inflects 1st per-               cussion we integrated knowledge from related dis-
son references into masculine/feminine forms. In                  ciplines, which can be instrumental to guide future
(Alhafni et al., 2020), the preferred gender of the               research and make it thrive. We conclude by sug-
speaker and the translated Arabic sentence are fed                gesting several directions that can help this field
to that component, which re-inflects the sentence in              going forward.
the desired form. In (Habash et al., 2019), instead,                 Model de-biasing. Neural networks rely on
the component can be: i) a two-step system that                   easy-to-learn shortcuts or “cheap tricks” (Levesque,
first identifies the gender of 1st person references              2014), as picking up on spurious correlations of-
in an MT output, and then re-inflects them in the                 fered by training data can be easier for machines
opposite form; ii) a single-step system that always               than learning to actually solve a specific task. What
produces both forms from an MT output. Their                      is “easy to learn” for a model depends on the induc-
method does not necessarily require speakers’ gen-                tive bias (Sinz et al., 2019; Geirhos et al., 2020) re-
der information: if metadata are supplied, the MT                 sulting from architectural choices, training data and
output is re-inflected accordingly; differently, both             learning rules. We think that explainability tech-
feminine/masculine inflections are offered (leav-                 niques (Belinkov et al., 2020) represent a useful
ing to the user the choice of the appropriate one).               tool to identify spurious cues (features) exploited
While beneficial for English-Arabic, their approach               by the model during inference. Discerning them
is not directly applicable to other languages. In fact,           can provide the research community with guidance
unlike (Saunders and Byrne, 2020), implementing                   on how to improve models’ generalisation by work-
the re-inflection component demanded the expen-                   ing on the data, architectures, loss functions and
sive work of data creation of the Arabic Parallel                 optimizations. For instance, data responsible for
Gender Corpus (§4.3). Along the same line, now                    spurious features (e.g. stereotypical correlations)
Google Translate also delivers two outputs for short              might be recognized and their weight at training
gender-ambiguous queries (Johnson, 2020b). How-                   time lowered (Karimi Mahabadi et al., 2020). Be-
ever, among languages with grammatical gender,                    sides, state-of-the-art architectural choices and al-
the service is available only for English-Spanish.                gorithms in MT have mostly been studied in terms
                                                                  of overall translation quality, without specific anal-
   In light of the above, we remark that there is no
                                                                  yses regarding gender translation. For instance, cur-
conclusive state-of-the-art method for mitigating
                                                                  rent systems segment text into subword units with
bias. The discussed interventions in MT tend to re-
                                                                  statistical methods that can break the morphologi-
spond to specific aspects of the problem with modu-
                                                                  cal structure of words, losing relevant semantic and
lar solutions, but if and how they can be conciliated
                                                                  syntactic information in morphologically-rich lan-
within the same MT system remains unexplored.
                                                                  guages (Niehues et al., 2016; Ataman et al., 2017).
Besides, gender bias in MT is a socio-technical
                                                                  Several languages show complex feminine forms,
problem. We thus highlight that engineering in-
                                                                  typically derivative and created by adding a suffix
  12
     Zmigrod et al. (2019) proposed an automatic approach for
                                                                  to the masculine form, like Lehrer/Lehrerin (de),
augmenting data into morphologically-rich languages, but it       studente/studentessa (it). It would be relevant to
is only viable for simple constructions with one single entity.   investigate whether, compared to other segmenta-
tion techniques, statistical approaches disadvantage       versarial networks including a discriminator that
(rarer and more complex) feminine forms. The MT            classifies speaker’s linguistic expression of gen-
community should not overlook focused hypothe-             der (masculine or feminine) could be employed to
ses of such kind, as they expand our comprehension         “neutralize” speaker-related forms (Li et al., 2018;
of the gender bias conundrum.                              Delobelle et al., 2020). On the other side, Direct
   Non-textual modalities. Gender bias for non-            Non-binary Language (DNL) aims at increasing
textual automatic translations (e.g. audiovisual)          the visibility of non-binary individuals via neol-
has been largely neglected. In this sense, ST rep-         ogisms and neomorphemes (Bradley et al., 2019;
resents a small niche (Costa-jussà et al., 2020a).        Papadopoulos, 2019). With DNL starting to circu-
For the translation of speaker-related gender phe-         late (Shroy, 2016; Santiago, 2018; López, 2019),
nomena, Bentivogli et al. (2020) prove that direct         the community is presented with the opportunity to
ST systems exploit speaker’s vocal characteristics         engage with the creation of more inclusive data.
as a gender cue to improve feminine translation.              Finally, as already highlighted in the law and
However, as addressed by Gaido et al. (2020), re-          social science theory, discrimination can arise from
lying on physical gender cues (e.g. pitch) for such        the intersection of multiple identity categories (e.g.
task imply reductionist gender classifications (Zim-       race and gender) (Crenshaw, 1989), which are not
man, 2020) within systems, making them poten-              additive and cannot always be detected in isolation
tially harmful for a diverse range of users. Simi-         (Schlesinger et al., 2017). Following the MT work
larly, although image-guided translation has been          by Hovy et al. (2020), as well as other intersec-
claimed useful for gender translation as it relies on      tional analyses from NLP (Herbelot et al., 2012;
visual inputs for disambiguation (Frank et al., 2018;      Jiang and Fellbaum, 2020) and AI-related fields
Ive et al., 2019), it could bend toward stereotypical      (Buolamwini and Gebru, 2018), future studies may
assumptions about appearance. Further research             account for the interaction of gender attributes with
should explore such directions to identify poten-          other sociodemographic classes.
tial challenges and risks, drawing on bias in image
captioning (van Miltenburg, 2019), and consoli-               Human-in-the-loop. Research on gender bias
dated studies from the fields of automatic gender          in MT is still restricted to lab tests. As such, un-
recognition and human computer interaction (HCI)           like for other studies relying on participatory de-
(Hamidi et al., 2018; Keyes, 2018; May, 2019).             sign (Turner et al., 2015; Cercas Curry et al., 2020;
                                                           Liebling et al., 2020), the advancement of the field
   Beyond Dichotomies. Apart from few notable
                                                           is not measured in line with people accounting for
exceptions for English NLP tasks (Manzini et al.,
                                                           their experiences, in relation to specific deployment
2019; Cao and Daumé III, 2020; Sun et al., 2021),
                                                           contexts. However, these are fundamental consid-
and one in MT (Saunders et al., 2020), the discus-
                                                           erations to guide the field forward and, as HCI
sion around gender bias has been reduced to the
                                                           studies show (Vorvoreanu et al., 2019), concretely
binary masculine/feminine dichotomy. Although
                                                           support the creation of gender-inclusive technol-
research in this direction is currently hampered by
                                                           ogy. Also, we invite the whole development pro-
the absence of data, we invite considering inclu-
                                                           cess to be paired with bias-aware research method-
sive solutions and exploring nuanced dimensions
                                                           ology (Havens et al., 2020) and HCI approaches
of gender. Starting from language practices, Indi-
                                                           (Stumpf et al., 2020), which help operationalize
rect Non-binary Language (INL) overcomes gen-
                                                           sensitive attributes like gender (Keyes et al., 2021).
der specifications (e.g. using service, humankind
                                                           Finally, MT is not only built for people, but also
rather then waiter/waitress or mankind).13 Whilst
                                                           by people. Thus, it is vital to reflect on implicit
more challenging, INL can be achieved also for
                                                           biases and backgrounds of the people involved in
grammatical gender languages (Motschenbacher,
                                                           MT pipelines at all stages and how they could be
2014; Lindqvist et al., 2019), and it is endorsed for
                                                           reflected in the model. This means starting from
official EU documents.14 Accordingly, MT mod-
                                                           bottom-level countermeasures, engaging with trans-
els could be brought to avoid binary forms and
                                                           lators (De Marco and Toto, 2019; Lessinger, 2020),
move toward gender-unspecified solution, e.g. ad-
                                                           annotators (Waseem, 2016; Geva et al., 2019), con-
  13
     INL suggestions have also been recently implemented
                                                           sidering everyone’s subjective positionality and,
within Microsoft text editors (Langston, 2020).            crucially, also the lack of diversity within technol-
  14
     See the Europarl guidelines.                          ogy teams (Schluter, 2018; Waseem et al., 2020).
You can also read