Societal Biases in Language Generation: Progress and Challenges - ACL Anthology

Page created by Tracy Figueroa
 
CONTINUE READING
Societal Biases in Language Generation: Progress and Challenges

       Emily Sheng1 , Kai-Wei Chang2 , Premkumar Natarajan1 , Nanyun Peng1,2
            1
              Information Sciences Institute, University of Southern California
         2
           Computer Science Department, University of California, Los Angeles
    {ewsheng,pnataraj}@isi.edu, {kwchang,violetpeng}@cs.ucla.edu

                      Abstract                              into developing NLG techniques.
                                                               We emphasize the importance of better under-
    Technology for language generation has ad-
                                                            standing how societal biases manifest in NLG tech-
    vanced rapidly, spurred by advancements in
    pre-training large models on massive amounts
                                                            niques, because NLG applications directly inter-
    of data and the need for intelligent agents to          act with many different users to generate novel
    communicate in a natural manner. While tech-            content in various domains (e.g., chat bots for
    niques can effectively generate fluent text, they       health, education, and customer support). However,
    can also produce undesirable societal biases            when techniques are less effective or detrimental
    that can have a disproportionately negative im-         for marginalized populations, these techniques can
    pact on marginalized populations. Language              inadvertently become gatekeepers of those popula-
    generation presents unique challenges for bi-
                                                            tions for generation and associated language tech-
    ases in terms of direct user interaction and
    the structure of decoding techniques. To bet-           nologies. For example, an educational chat bot that
    ter understand these challenges, we present a           produces more negative responses for topics about
    survey on societal biases in language gener-            a specific ethnicity will discourage users of that eth-
    ation, focusing on how data and techniques              nicity from interacting with the chat bot. While it
    contribute to biases and progress towards re-           is generally important to study the societal impact
    ducing biases. Motivated by a lack of studies           of NLP and AI techniques, we argue that the direct
    on biases from decoding techniques, we also
                                                            user impact of NLG techniques makes it especially
    conduct experiments to quantify the effects
    of these techniques. By further discussing
                                                            important to carefully quantify the impact.
    general trends and open challenges, we call                Motivated by the importance of fairness in lan-
    to attention promising directions for research          guage generation, we present the first comprehen-
    and the importance of fairness and inclusivity          sive survey on societal biases in language genera-
    considerations for language generation appli-           tion. By enumerating how NLG techniques con-
    cations.                                                tribute to biases and examining progress towards
                                                            bias analysis and mitigation, we contextualize the
1   Introduction
                                                            discussion of broader trends and challenges. Specif-
Natural language generation (NLG) is a suite of             ically, we focus on techniques for NLG tasks, i.e.,
techniques that enables the generation of human-            tasks that generate a sequence of text.1 Finding a
readable language for different goals. These tech-          lack of studies on biases from decoding techniques,
niques are the core components of applications              we additionally present an experimental study to
such as virtual assistants, chat bots, automatic trans-     quantify the effects of various decoding techniques.
lators, summarizers, and creative language com-                Before we delve into the details of biases in lan-
posers. Recent advances in techniques for language          guage generation, we first position our survey in
generation (e.g., GPT (Radford et al., 2018), GPT-2         the context of other relevant surveys and position
(Radford et al., 2019), GPT-3 (Brown et al., 2020),         papers. Sun et al. (2019) present a focused survey
TransformerXL (Dai et al., 2019), XLNet (Yang                   1
                                                                  Although bi-directional language models like BERT (De-
et al., 2019)) powered by Transformers (Vaswani             vlin et al., 2019) can also be used for auto-regressive gener-
et al., 2017) and an increasing repository of avail-        ation (Wang and Cho, 2019; Chen et al., 2020), traditional
                                                            auto-regressive models are still typically of better quality and
able data have created more capable applications.           more widely used for generation (Shwartz et al., 2020). Thus,
This has, in turn, channeled more interest and effort       we limit the scope of this survey to the latter models.

                                                        4275
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 4275–4293
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
Demo. Dim.   NLG Task        Works
    Gender       Autocomplete Bordia and Bowman (2019); Qian et al. (2019); Solaiman et al. (2019); Sheng et al. (2019,
                              2020); Vig et al. (2020); Yeo and Chen (2020); Brown et al. (2020); Dhamala et al. (2021);
                              Schick et al. (2021); Nozza et al. (2021); Kirk et al. (2021)
                 Dialogue     Henderson et al. (2018); Dinan et al. (2020a); Liu et al. (2020a,b); Cercas Curry et al. (2020);
                              Sheng et al. (2021a,b)
                 MT           Vanmassenhove et al. (2018); Elaraby et al. (2018); Prates et al. (2019); Stanovsky et al.
                              (2019); Escudé Font and Costa-jussà (2019); Cho et al. (2019); Moryossef et al. (2019);
                              Saunders and Byrne (2020); Saunders et al. (2020); Kocmi et al. (2020); Costa-jussà and
                              de Jorge (2020); Costa-jussà et al. (2020); Basta et al. (2020); Farkas and Németh (2020);
                              Stafanovičs et al. (2020); Gonen and Webster (2020); Hovy et al. (2020); Roberts et al.
                              (2020); Cho et al. (2021); Savoldi et al. (2021); Renduchintala and Williams (2021); Choubey
                              et al. (2021); Saunders et al. (2021); Tomalin et al. (2021)
                 Re-writing   Habash et al. (2019); Zmigrod et al. (2019); Alhafni et al. (2020); Sun et al. (2021)
    Profession   Autocomplete Huang et al. (2020); Dhamala et al. (2021)
    Race         Autocomplete Solaiman et al. (2019); Sheng et al. (2019, 2020); Groenwold et al. (2020); Brown et al.
                              (2020); Dhamala et al. (2021); Schick et al. (2021); Kirk et al. (2021)
                 Dialogue     Sheng et al. (2021a,b)
    Religion     Autocomplete Solaiman et al. (2019); Brown et al. (2020); Dhamala et al. (2021); Kirk et al. (2021); Abid
                              et al. (2021)
    Sexuality    Autocomplete Sheng et al. (2019, 2020); Kirk et al. (2021)
                 Dialogue     Sheng et al. (2021a)
    Other        Autocomplete Shwartz et al. (2020); Peng et al. (2020); Huang et al. (2020); Dhamala et al. (2021); Kirk
                              et al. (2021)
                 Dialogue     Sheng et al. (2021a)
                 Re-writing   Pryzant et al. (2020); Ma et al. (2020)

Table 1: Existing bias studies on different demographic dimensions in various NLG tasks: autocomplete genera-
tion, dialogue generation, machine translation (MT), and text re-writing.
on mitigating gender biases and Shah et al. (2020)              that generate text continuations conditioned on
categorize sources of biases—both largely focus                 some prompt and those that transform text from
on natural language understanding (NLU) tasks,                  one form to another. Table 1 organizes various
while we examine biases in NLG tasks. Addition-                 bias-related works for NLG tasks.
ally, Blodgett et al. (2020) urge for more explicitly
tying “biases” in NLP to societal normative defi-               2.1    Continuation Generation Tasks
nitions of biases and social hierarchies; with their            The continuation class includes autocomplete and
recommendations in mind, we discuss the negative                dialogue generation, where the goal is to generate
impacts of biases in NLG techniques.                            text that is coherent and relevant to a prompt.
   Our contributions are a comprehensive survey
                                                                Autocomplete Generation We use the term au-
on societal biases in language generation and an
                                                                tocomplete generation to refer to conditional gen-
experimental study on biases from decoding tech-
                                                                eration directly from language models. Language
niques. To start, we describe classes of NLG tasks
                                                                models are the core components for many NLG
(Sec. 2) and subsequently examine examples of bi-
                                                                and NLU tasks, and this task enables directly quan-
ases and harms in NLG (Sec. 3). We then discuss
                                                                tifying biases in large, pre-trained language models
NLG techniques that facilitate biases, including a
                                                                (Bordia and Bowman, 2019; Sheng et al., 2019;
study of decoding techniques (Sec. 4). Sec. 5 high-
                                                                Solaiman et al., 2019; Brown et al., 2020). Exist-
lights progress and challenges, and Sec. 6 presents
                                                                ing works analyzing biases in autocomplete gen-
open problems and proposals. We hope this survey
                                                                eration have mostly examined Transformer-based
brings more visibility to the importance of carefully
                                                                models, including GPT (Shwartz et al., 2020), GPT-
considering different components of NLG pipelines
                                                                2 (Solaiman et al., 2019; Sheng et al., 2019, 2020;
for potential biases and mitigation methods.
                                                                Shwartz et al., 2020; Vig et al., 2020; Yeo and Chen,
                                                                2020; Huang et al., 2020; Dhamala et al., 2021;
2      Language Generation Tasks
                                                                Schick et al., 2021), GPT-3 (Brown et al., 2020),
To begin, we categorize generation tasks and in-                CTRL (Dhamala et al., 2021), TransformerXL
troduce existing bias studies relevant to each task.            (Shwartz et al., 2020; Vig et al., 2020; Huang et al.,
NLG tasks broadly fall into two categories: those               2020), and XLNet (Shwartz et al., 2020; Vig et al.,

                                                           4276
2020; Yeo and Chen, 2020), though Bordia and                     study Microsoft Translator,4 Amazon Translate,5
Bowman (2019); Qian et al. (2019) also look at                   and SYSTRAN;6 Cho et al. (2019) additionally
LSTM-based models.                                               look at Naver Papago7 and Kakao Translator,8 and
Dialogue Generation Dialogue generation is                       Cho et al. (2021) also examine Yandex.9
conditioned on user inputs and can be for spe-                   Re-writing We use the term re-writing to refer
cific domains (e.g., health, customer service) and               to tasks of revising specific words and phrases in
tasks (e.g., behavior intervention, booking flights)             the original text to be more aligned with a targeted
or general chit-chat. These dialogue applications                attribute. Specifically, there have been studies on
directly interact with users, and any propagated                 re-inflection (Habash et al., 2019; Zmigrod et al.,
biases directly affect user behavior and actions.                2019; Alhafni et al., 2020) and re-writing text to use
In terms of recurrent dialogue models, Henderson                 neutral viewpoints (Pryzant et al., 2020), gender-
et al. (2018) analyze biases in hierarchical recur-              neutral English (Sun et al., 2021), or more agency
rent encoder-decoder architectures and Liu et al.                (Ma et al., 2020). These tasks typically rely on
(2020a,b) analyze LSTM-based encoder-decoder                     custom encoder-decoder models.
models. Other works on dialogue biases (Dinan
                                                                 2.3      Other Tasks
et al., 2020a; Sheng et al., 2020, 2021b) focus
on Transformer-based models such as DialoGPT                     There are other NLG tasks, such as the continua-
(Zhang et al., 2020) and other custom architectures.             tion tasks of story and poetry generation, and the
                                                                 transformation tasks of abstractive summarization
2.2    Transformation Generation Tasks                           and paraphrase generation. However, these other
                                                                 NLG tasks are not yet well-studied in the context
The transformation class includes machine trans-
                                                                 of societal biases.10
lation and various formulations of text re-writing.
The general goal of these tasks is to transform text             3       Biases and their Negative Impacts
into a form with targeted properties.
Machine Translation Translation is the task of                   In this section, we introduce how existing studies
transforming text between languages while pre-                   of biases in NLG tasks commonly quantify biases
serving the meaning. Existing works on biases                    and their negative impacts.
in machine translation have almost exclusively fo-               3.1      Bias Definitions and Metrics
cused on issues of gender biases2 in a variety of
                                                                 In the context of AI fairness, the term “bias” com-
academic and commercial systems. The use of
                                                                 monly refers to skews that result in undesirable
grammatical gender in some languages and not in
                                                                 impacts (Crawford, 2017) and is quantifiable with
others can expose unwanted gender associations
                                                                 some metric. There are relatively more existing
(e.g., for different occupations) through translation
                                                                 studies on biases in NLU tasks, where it is arguably
(Prates et al., 2019). Earlier works by Vanmassen-
                                                                 simpler to define bias metrics, since we can intu-
hove et al. (2018) and Elaraby et al. (2018) study
                                                                 itively compare the accuracy of the task (e.g., coref-
LSTM-based encoder-decoder translation systems,
                                                                 erence resolution, hate speech detection) for differ-
and more recent works examine Transformer-based
                                                                 ent demographics. Language generation tasks often
architectures (Escudé Font and Costa-jussà, 2019;
                                                                 involve stochastic generation of open-ended and
Stanovsky et al., 2019; Saunders and Byrne, 2020;
                                                                 lengthy texts, traits that are not directly compatible
Saunders et al., 2020; Costa-jussà and de Jorge,
                                                                 with traditional algorithmic bias definitions (e.g.,
2020; Basta et al., 2020; Stafanovičs et al., 2020;
                                                                     4
Renduchintala and Williams, 2021; Choubey et al.,                      https://www.bing.com/translator
                                                                     5
2021; Saunders et al., 2021; Tomalin et al., 2021).                    https://aws.amazon.com/translate
                                                                     6
                                                                       https://www.systransoft.com
While Google Translate3 has been the most pop-                       7
                                                                       https://papago.naver.com
ular commercial system to analyze for gender bi-                     8
                                                                       https://translate.kakao.com
                                                                     9
ases (Prates et al., 2019; Moryossef et al., 2019;                     https://translate.yandex.com
                                                                    10
Stanovsky et al., 2019; Cho et al., 2019; Farkas                       Lucy and Bamman (2021) is an exception that analyzes
                                                                 gender in generated stories. While there are studies of bi-
and Németh, 2020), Stanovsky et al. (2019) also                 ases in poetry generation and summarization, they focus on
                                                                 non-NLG biases: Sheng and Uthus (2020) investigate biases
   2
      For a detailed survey of gender bias in machine transla-   in a poetry composition system, but in the context of infor-
tion, we refer readers to Savoldi et al. (2021).                 mation retrieval; Celis and Keswani (2020) analyze biases in
    3
      https://translate.google.com                               extractive summarization.

                                                            4277
equalized odds, equal opportunity, demographic           pacts. We survey detrimental representational11
parity (Dwork et al., 2012; Hardt et al., 2016)).        and allocational12 impacts (Crawford, 2017; Baro-
    Because of the difficulty in defining metrics, ex-   cas et al., 2017; Blodgett et al., 2020) used to moti-
isting works define bias loosely as demographic          vate existing studies of bias in NLG tasks, finding
inequality and use intermediate proxy metrics to         limited examples. While representational impacts
comparatively measure bias. Examples include:            are sometimes cited, it is difficult to measure the
• Regard Ratio: negative-neutral-positive regard         extent of the impacts. Additionally, techniques
   score ratios of text generated from bias-inducing     for effective NLG are relatively new, and existing
   prompts (Sheng et al., 2019)                          studies have limited knowledge of potential alloca-
• Sentiment Ratio: negative-neutral-positive sen-        tional impacts. Finally, biases in NLG tasks give
   timent score ratios of text generated from African    rise to a third type of negative impacts, which we
   American English (AAE) versus White-Aligned           call vulnerability impacts.
   English (WAE) prompts (Groenwold et al., 2020)        Representational Impacts The works in Ta-
• Individual and Group Fairness through Sen-             ble 1 motivate (to varying degrees) studying bi-
   timent: comparisons of the sentiment distribu-        ases in NLG through potential negative representa-
   tions of generated text across demographics and       tional impacts, in the form of propagating stereo-
   prompts (Huang et al., 2020)                          types, misrepresentations, or denigrations of social
• Gendered Word Co-occurrence Score: mean                groups. For example, Sheng et al. (2019) enumer-
   and standard deviations of the absolute log ra-       ate how generated text can propagate varying social
   tio of probabilities: P(word|female terms) to         perceptions of different demographics, and Prates
   P(word|male terms) across all words in gener-         et al. (2019) discuss how occupation-related gender
   ated text (Bordia and Bowman, 2019)                   biases could propagate stereotypes in translation.
There are also metrics for other bias evaluation         However, it is difficult to quantify the effects of rep-
setups in continuation generation tasks involving        resentational impacts;13 while such impacts may be
sentiment (Shwartz et al., 2020), the ratio of gen-      measured indirectly (e.g. by analyzing allocational
dered words (Solaiman et al., 2019; Vig et al., 2020;    impacts), we suggest long-term, interdisciplinary
Dinan et al., 2020a), and other novel metrics (Peng      collaborations to explore the direct effects of these
et al., 2020; Yeo and Chen, 2020). Studies of biases     representational impacts.
in transformation generation tasks favor metrics of      Allocational Impacts Harmful allocational im-
accuracy in terms of successfully transforming text      pacts result from an unequal allocation of resources
to have a desired property. We present a more thor-      across groups. Since effective NLG techniques
ough comparison of metrics in Section 5.4.               based on large Transformer models (Vaswani et al.,
    Bias metrics can also be categorized by how they     2017) are relatively new, most of the existing works
define associations between demographic group at-        on biases in NLG that list possible impacts only
tributes and text. Biases can be towards people          analyze direct representational consequences. A
described in text, people who produce the text,          real example of a negative allocational impact is
or people to whom the text is addressed (Dinan           when machine translation errors lead to arrests
et al., 2020b). Most existing works define bias          (Ong, 2017). In general, technologies that are less
metrics through the first association—these biases       effective or detrimental for certain populations be-
are relatively easier to analyze, since both the de-     come barriers that actively prevent those popula-
mographic and the textual signals of bias are en-        tions from using the technology, leading to dimin-
capsulated within the text. There are also works         ished opportunities in jobs, education, health, etc.
that define biases towards people who produce the        We discuss more details in Section 4.5. With contin-
text (Groenwold et al., 2020) or people to whom          uous technological advances, more organizations
the text is addressed (Sheng et al., 2021b), though      will turn to effective NLG techniques, making it
there are relatively fewer works that study these        imperative to start setting norms to reduce harmful
latter associations.                                     allocational impacts (Tamkin et al., 2021).

3.2   Negative Impacts                                     11
                                                               Unfair representations of different groups
                                                           12
                                                               Unfair allocation of resources
Biases in NLG techniques are important to study             13
                                                               Kay et al. (2015) is a rare example that explicitly studies
because they can result in harmful, negative im-         the effect of representational impacts in image search.

                                                    4278
Vulnerability Impacts Open-domain generation                    could lead to biases, and we refer readers to their
tasks can amplify a group’s vulnerability to manip-             works for more discussion. In the context of trans-
ulation and harm, which is an intermediate impact               lation, Cho et al. (2021) find that more data can
that makes a group more susceptible to represen-                increase translation fluency but may also make the
tational and allocational impacts. For example,                 system more biased.
privacy-related issues (Carlini et al., 2020), misin-
formation (Levy et al., 2021), or radicalizing views            4.2    Biases from Model Architecture
in generated text could make a group more likely to             There are relatively few studies that examine model
be attributed to specific stereotypes (e.g., through            architectural properties that could lead to biases.
action guided by misinformation) or end up with                 We discuss the few efforts towards understanding
diminished opportunities (e.g., by having personal              model biases in NLG tasks and emphasize the need
data exposed and misused). Separately identifying               for more to generalize. For autocomplete gener-
vulnerability impacts could help facilitate recogni-            ation, Vig et al. (2020) analyze GPT-2 variants
tion of other negative impacts.                                 through a causal mediation analysis, finding that
                                                                larger models contain more gender bias, and bias
4        Contributors to NLG Biases                             tends to be concentrated in a small number of neu-
                                                                rons and attention heads. Silva et al. (2021) ob-
In a pipeline from data collection to evaluation
                                                                serve amplified biases in distilled versus original
for an NLG task, each component could propagate
                                                                models. For machine translation, Costa-jussà et al.
biases.14 We emphasize the ways in which data,
                                                                (2020) note that language-specific architectures are
model architecture, decoding, evaluation, and de-
                                                                less biased because they encode more gender in-
ployment uniquely exacerbate biases in generation
                                                                formation than shared language encoder-decoder
tasks. Additionally, we present an empirical study
                                                                architectures. Studies like the aforementioned are
to show how measured biases in generated text can
                                                                useful for designing targeted bias mitigation meth-
vary based on decoding technique.
                                                                ods (e.g., controlled generation to target specific
4.1       Biases from Data                                      attention heads or regularization to retain gender
                                                                information). However, more evidence would be
Modern NLP models often rely on large pre-trained               needed to generalize findings across models.15
language models, which in turn rely on a large col-
lection of data to learn explicit and implicit associ-          4.3    Biases from Decoding
ations. Several recent pre-trained language models              While NLU and NLG models have structural simi-
used for NLG tasks, e.g., T5 (Raffel et al., 2020)              larities, NLG tasks uniquely use search or sampling
and GPT-3 (Brown et al., 2020), are trained on the              techniques at inference time to generate text. Popu-
largest datasets used for any models. These large               lar techniques include:
models for generation are commonly trained on                   • Greedy Search: at each time step, choose the
web data, which is known to contain biased lan-                   word with the highest probability.
guage (e.g., Ferrer et al. (2021) discover gender,              • Beam Search: at each time step, keep the top b
religion, and ethnic biases in Reddit communities).               hypotheses with highest probabilities; eventually
While preprocessing is often included to filter out               pick the hypothesis with the highest probability.
malformatted data and explicitly negative content               • Top-k sampling (Fan et al., 2018): at each time
(e.g., bad words and offensive phrases), those are                step, re-distribute the probability mass of the top
generally the only efforts to reduce biases and as-               k words with highest probabilities and sample.
sociated impacts. Furthermore, by filtering out all             • Nucleus sampling (Holtzman et al., 2019): at
words deemed “bad”, Bender et al. (2021) warns                    each time step, re-distribute the probability mass
that we remove the discourse of marginalized pop-                 of the smallest set of words with a cumulative
ulations. Paullada et al. (2020), Bender and Fried-               probability exceeding p and sample.
man (2018), and Gebru et al. (2018) provide more
                                                                More constrained forms of generation such as ma-
comprehensive surveys and frameworks that focus
                                                                chine translation generally use variations of beam
on aspects of data creation and management that
                                                                   15
                                                                      We also refer the reader to the work of Park et al. (2018)
    14
     Task formulation and application deployment are also       that discusses biases in NLU tasks from model components
part of NLG task pipelines (Kiritchenko et al., 2020), though   that “attend” to specific words (e.g., through attention or pool-
we do not focus on biases in these areas.                       ing), which could be applicable to NLG tasks as well.

                                                            4279
search; however, preferred decoding techniques are             age researchers to document sampling techniques,
more varied for open-domain generation. Despite                consider how metrics can be formulated to evaluate
variations in fluency and diversity between deter-             both bias and other factors of generation quality,
ministic versus stochastic, search versus sampling             and inspire more comprehensive studies.18
procedures, there are limited studies (Roberts et al.,
2020) on how different decoding properties affect              4.4       Biases from Evaluation
biases in generation.                                          Biases can arise from both general evaluations and
A Study on Biases from Decoding To study                       bias evaluations for NLG tasks.
how decoding techniques affect biases in gener-                General Evaluations Current standards for
ation, we use existing NLG bias metrics to evalu-              NLG evaluation can reinforce certain types of lan-
ate text generated from different decoding meth-               guage and penalize others. For example, using
ods.16 We examine autocomplete generations from                perplexity as measured by models pre-trained on
GPT, GPT-2, and XLNet, using the decoding tech-                datasets largely containing non-AAE text leads
niques from Section 4.3. We evaluate with the                  to an unfair evaluation of AAE text. Addition-
following bias metrics: regard ratios (Sheng et al.,           ally, the subjectivity of generation tasks means that
2019), sentiment ratios (Groenwold et al., 2020),              much of NLG evaluation depends on human labels.
individual and group fairness through sentiment                Since humans from different backgrounds are ac-
scores (Huang et al., 2020), and gendered word                 customed to different societal norms and linguistic
co-occurrence scores (Bordia and Bowman, 2019)                 variations, the choice of human annotators could
(as introduced in Section 3). More experimental                drastically influence the evaluation standards for
details can be found in the Appendix.                          generated text.
   In Section 5.4, we distinguish between relative             Bias Evaluations It is difficult to evaluate so-
and absolute score metrics to examine evaluation               cietal biases in NLG tasks because NLG can be
differences between NLG tasks. Here, we orga-                  open-domain, and there are many different notions
nize our results into these categories to generalize           of biases from various backgrounds and cultures
trends about decoding techniques. The ratio-based              (Sambasivan et al., 2021). These factors lead to
metrics are relative score metrics, since evaluation           the use of a variety of metrics to evaluate biases
relies on comparing ratios between demographics.               (Section 3). To avoid experimental bias in eval-
The latter three metrics are absolute score metrics            uation, we recommend using multiple metrics to
that have target values of zero indicating no bias.            cover many types of biases at various granulari-
   For the relative score metrics, search and sam-             ties. We identify three points to emphasize the
pling techniques generate similar outcomes. An                 need for more comprehensive evaluations. First,
interesting result between sampling techniques for             most existing works on biases in generation center
the regard metric is that nucleus sampling is less             around one demographic dimension (often gender
biased yet more negative than top-k sampling. For              and from a Western perspective, e.g., using stan-
the absolute score metrics, we find that beam search           dard Western occupations). While there has been
is the most unbiased technique, closely followed               no comprehensive study on whether mitigating bi-
by greedy search and then top-k and nucleus sam-               ases for one demographic dimension (e.g., gender)
pling. Through our study, we discover that text                may exacerbate biases for others (e.g., race, inter-
diversity is not accounted for in any of the bias              sectional identities), this is a possibility we must
metrics, yet diversity can be a confounding fac-               consider. Second, most works only evaluate bias
tor. Specifically, beam search is the least diverse,17         through a single intermediate proxy; however, dif-
followed by greedy search, top-k sampling, then                ferent metrics are defined at different granularities
nucleus sampling. Results indicate that the less               (e.g., sentiment is sentence-level, gendered word
diverse search techniques lead to better scores for            ratio is word-level). Finally, different evaluation
individual fairness, group fairness, and gendered              datasets test for specific types of biases and are
word co-occurrence ratios.                                     influenced by the backgrounds of the curators. Col-
   We hope these experimental results will encour-             lectively evaluating biases across demographic di-
   16
      Code     at    https://github.com/ewsheng/               mensions and granularities can thus help reduce
decoding-biases.                                               experimentally-biased evaluations.
   17
      We report average generated text length and vocabulary
                                                                  18
sizes to estimate diversity in Appendix Table 4.                       Results are summarized in Appendix Tables 2, 3, and 5.

                                                           4280
4.5    Biases from Deploying Systems                     Curated Datasets Existing datasets to study bi-
                                                         ases in translation include parallel sentences tagged
In terms of deploying NLG systems, there is a
                                                         with speaker or subject gender information (Van-
feedback loop that benefits some communities and
                                                         massenhove et al., 2018; Habash et al., 2019) and
further disadvantages others. While this feedback
                                                         datasets to study gender biases when translating
loop is not unique to NLG systems, these systems
                                                         from neutral references of a person (e.g., nurse in
that directly interact with users make good caution-
                                                         English, gender-neutral pronouns) to gendered in-
ary examples.
                                                         stances (e.g., enfermera or enfermero in Spanish,
   First, many deployed language technologies re-
                                                         gendered pronouns) (Cho et al., 2019; Stanovsky
quire internet access both to use and contribute
                                                         et al., 2019; Gonen and Webster, 2020; Kocmi et al.,
feedback, thus favoring the views and languages of
                                                         2020). Renduchintala and Williams (2021) addi-
those privileged with this access. For example, any-
                                                         tionally provide a dataset to study translation of
one can contribute feedback to Google Translate,
                                                         neutral references in unambiguous contexts. Other
but if contributions and subsequent improvements
                                                         works present parallel corpora of biased versus un-
are focused on high-resource languages, this fur-
                                                         biased framings and presuppositions (Pryzant et al.,
ther increases the accuracy gap between the high
                                                         2020) and AAE versus WAE equivalents (Groen-
and low resource languages, diminishing opportuni-
                                                         wold et al., 2020). Sheng et al. (2019); Huang et al.
ties for speakers of the low resource languages, i.e.,
                                                         (2020); Dhamala et al. (2021) additionally curate
representation disparity (Hashimoto et al., 2018).
                                                         sets of prompts that can be used to evaluate biases
   Second, those who are unable to achieve their         in autocomplete generation.
goals from using these language technologies (e.g.,      Bias Analysis Most bias analyses of NLG tasks
unsuccessful translation, unhelpful or offensive         use prompts to probe for different biases in gener-
chat bot) are less likely to continue using the tech-    ated text, e.g., regarding social perception (Sheng
nology. This means that there is less feedback and       et al., 2019), gender in translation (Prates et al.,
data to improve the technologies, reinforcing the        2019), names (Shwartz et al., 2020), sentiment
decreased effectiveness for certain populations, i.e.,   distribution (Huang et al., 2020), dialects (Groen-
disparity amplification (Hashimoto et al., 2018).        wold et al., 2020), dialogue personas (Sheng et al.,
   One way we might intervene is to follow a more        2021a), or other notions of similarity across demo-
targeted approach for data and feedback collection,      graphics (Yeo and Chen, 2020; Henderson et al.,
e.g., from excluded populations. However, we ac-         2018). Vig et al. (2020) also use prompts to investi-
knowledge that this remains a difficult task and         gate gender biases, though they do so in the context
that it is also necessary to be aware of “commu-         of a causal mediation analysis. Furthermore, Prates
nity goals” and other factors in order to co-design      et al. (2019) and Farkas and Németh (2020) com-
language technologies without inflicting additional      pare pronoun gender biases in translations (induced
harm on marginalized populations (Bird, 2020).           with prompts) to real-world statistics.
                                                         Bias Mitigation Methods can broadly be classi-
5     Progress, Trends, and Challenges                   fied into two categories based on the type of data ap-
                                                         plied. The first category encompasses methods that
Following the discussion of contributors to biases,
                                                         fine-tune or train on a balanced dataset to lessen
we survey trends and challenges for reducing biases
                                                         the effects of the model relying on spurious corre-
in NLG.
                                                         lations between imbalanced data and task perfor-
                                                         mance. CDA has been applied to datasets used for
5.1    Data Methods
                                                         continued or fresh training in dialogue generation
Data-based methods for both bias analysis and mit-       (Dinan et al., 2020a; Liu et al., 2020a) as well as
igation use the general idea of counterfactual data      machine translation (Saunders and Byrne, 2020;
augmentation (CDA) (Lu et al., 2020) to curate sets      Costa-jussà and de Jorge, 2020; Stafanovičs et al.,
of counterfactual prompts. A common method for           2020). The second category is methods that at-
analysis is using targeted prompts to induce NLG         tach a short prefix at training time (Vanmassenhove
models to reveal biases. For data-based mitigation,      et al., 2018; Basta et al., 2020; Alhafni et al., 2020)
existing works focus on fine-tuning large models         or inference time (Moryossef et al., 2019).
or training smaller models with datasets that are        Challenges The size of state-of-the-art pre-
balanced with respect to targeted demographics.          trained models and varying definitions of biases

                                                    4281
in generation present difficulties for creating stan-    be more equalized towards different demographics.
dardized datasets that are generally effective across    For translation, Saunders and Byrne (2020) present
biases and demographics. Moreover, it remains to         a lattice rescoring procedure that creates gender-
be seen whether data-based mitigation is as effec-       inflected search spaces to rescore text for more ac-
tive for open-domain NLG tasks as it is for more         curate translations, and Saunders et al. (2021) sub-
constrained settings.                                    sequently use this lattice structure to present more
                                                         gendered options during beam search and rerank
5.2   Training Methods                                   translation hypotheses according to gender criteria.
In addition to data-based mitigation, training-based     For dialogue generation, Sheng et al. (2021b) in-
mitigation is another popular class of methods to        troduce a constrained decoding method that uses
reduce biases in generation.                             n-gram similarity to guide generation away from
Bias Mitigation Several works that use training-         ad hominems towards marginalized groups. For au-
based mitigation techniques rely on regularization       tocomplete generation, Schick et al. (2021) present
(Bordia and Bowman, 2019; Qian et al., 2019;             a self-debiasing scheme that re-weights word prob-
Huang et al., 2020; Liu et al., 2020a; Saunders and      abilities to generate less undesirable words.
Byrne, 2020). There are also works that induce con-      Challenges Control methods at inference time
trol by incorporating a bias control code through        could potentially steer the model into degenerate
conditional training (Dinan et al., 2020a), by ap-       spaces, so it is important to also evaluate these
pending a target value to inputs during training         methods for coherence, fluency, and task relevance.
(Ma et al., 2020), by using a normative classifier to
produce reward values for backpropagation (Peng          5.4   Evaluation Methods
et al., 2020), or through adversarial training (Liu      There are two types of evaluations: those that rely
et al., 2020b). Other techniques include using de-       on absolute scores and those that rely on relative
biased word embeddings (Escudé Font and Costa-          scores. Absolute score evaluations use an accu-
jussà, 2019), identifying and editing out subjective    mulated score to summarize inequalities between
words (Pryzant et al., 2020), and using Markov ran-      demographics, whereas relative evaluations explic-
dom fields to preserve morpho-syntactic agreement        itly report inequalities between all demographics.
during reinflection (Zmigrod et al., 2019).              While it is possible to convert between relative and
Challenges The main challenge of bias mitiga-            absolute scores, distinguishing between how exist-
tion through training methods is that it is costly and   ing works choose to portray evaluations allows us
impractical to re-train models for new biases en-        to examine differences between generation tasks.
countered. In fact, most of the techniques that rely     Absolute Evaluations We find that the transfor-
on training from scratch use smaller architectures       mation class of generation tasks favors bias evalu-
(exceptions are from larger institutions).               ation through absolute metrics, which is possible
                                                         because these tasks involve relatively more con-
5.3   Inference Methods                                  strained forms of generation. Examples of eval-
While the existing literature on inference time meth-    uation objectives through absolute scores include
ods for bias mitigation is sparse, decoding-based        Peng et al. (2020) reducing non-normative gener-
methods are a promising alternative to data- and         ations, Ma et al. (2020) increasing the accuracy
training-based methods. Specifically, these meth-        of the change in agency, Zmigrod et al. (2019) in-
ods are compatible with any pre-trained language         creasing the number of correct inflections, Huang
model for generation without additional training.        et al. (2020) reducing individual and group fair-
Given recent development of inference-time meth-         ness scores, and Sheng et al. (2021b) reducing
ods for control that can reduce toxicity (e.g., PPLM     the amount of ad hominems towards marginalized
(Dathathri et al., 2019), GeDi (Krause et al., 2020),    groups. Studies of gender bias in machine trans-
DExperts (Liu et al., 2021)), there is potential for     lation are well-suited to evaluations using abso-
extending these methods to bias mitigation.              lute scores: many use BLEU and its variants to
Bias Mitigation For autocomplete and dialogue            evaluate correct gender inflections and translations
generation, Sheng et al. (2020) formulate bias trig-     (Moryossef et al., 2019; Escudé Font and Costa-
gers using gradient-based methods of Wallace et al.      jussà, 2019; Elaraby et al., 2018; Habash et al.,
(2019). These triggers are appended to prompts           2019; Alhafni et al., 2020) or accuracy on WinoMT
during inference time to control text generation to      (Saunders and Byrne, 2020; Saunders et al., 2020;

                                                    4282
Kocmi et al., 2020; Costa-jussà and de Jorge, 2020;     for potential harms. Since effective models for
Costa-jussà et al., 2020; Basta et al., 2020; Choubey   NLG tasks are correlated with increasing training
et al., 2021; Saunders et al., 2021).                    data sizes, biases in data collection (e.g., English-
Relative Evaluations In terms of evaluation              centric, drawn from popular Western media) re-
through relative scores, examples from existing          main a major contributor of biases that manifest
works are mainly from continuation generation            in generation. Additionally, datasets used to study
tasks. We infer that the less constrained, open-         biases in generation can also be limited (e.g., only
domain nature of continuation generation tasks           for binary gender classes). For more bias-aware
makes it more preferable to evaluate mitigation          data curation, we suggest diversifying datasets to
through more flexible comparisons rather than ab-        include more viewpoints from various groups.
solute scores. For autocomplete generation, Sheng        Understanding Trade-Offs Different methods
et al. (2019, 2020) and Groenwold et al. (2020)          for analysis, mitigation, and evaluation have unique
compare regard or sentiment scores across demo-          trade-offs. Existing works have been relatively
graphics, Shwartz et al. (2020) compare names            small-scale and limited to a small number of biases
across various intermediate metrics, Vig et al.          for specific tasks. Some useful questions to con-
(2020) measure proportional differences between          sider when developing methods to study generation
the amount of bias under a gendered versus ambigu-       biases are whether we can generalize methods to a
ous reading, and Yeo and Chen (2020) compare             diverse set of biases and a wide range of contexts.
occupations generated for different genders. Bias        It is also important to consider formulating met-
studies in dialogue generation use relative scores       rics that would jointly mitigate biases and preserve
by comparing sentiment and offensive language            other desired text qualities (e.g., diversity, fluency).
discrepancies (Henderson et al., 2018; Liu et al.,
2020a,b) and the percentage of gendered words            Interactive and Continuous Learning The dif-
(Dinan et al., 2020a).                                   ficulties of measuring and mitigating biases in gen-
Challenges A trade-off between framing biases            eration can be reduced with a general framework
as a relative or absolute metric is that relative met-   for interactive and continuous learning. Over time,
rics can be more flexibly aligned to normative con-      such a system could learn from diverse opinions of
cerns like social perception. Absolute metrics that      what constitutes “fair” versus “unfair” generations
look for ratios of gendered words or other indica-       across tasks. A unified framework would centralize
tor words assume that there is a set of words that       and highlight the importance of studying biases in
captures all the differences between demographic         generation, as well as fuel the development of a
groups, regardless of whether these differences are      more comprehensive set of evaluations that may be
related to normative definitions of harm. There          useful for large-scale studies of impact.
are also absolute metrics such as those of Huang         Focusing on Negative Impacts Section 3 dis-
et al. (2020) that can incorporate intermediate met-     cusses how there are very few existing works on
rics that are more aligned with normative behavior,      biases that explicitly and meaningfully engage with
though these metrics reduce the notion of biases to      resulting negative impacts, even though these im-
a single value, which could erase historical inequal-    pacts are what motivate reducing biases. By re-
ities between groups.                                    framing efforts on reducing negative impacts rather
                                                         than biases, we may be able to define metrics and
6   Open Problems and Proposals                          progress that better correlate with reducing harm.
                                                         For example, relative framings of bias metrics
As a fairly nascent area of exploration, the study
                                                         could better enable metrics to be more aligned with
of biases in language generation still poses many
                                                         reducing harms for particularly impacted groups.
challenges. Throughout this paper, we discuss chal-
lenges associated with different components in a
generation pipeline. With a heightened awareness         Acknowledgments
of the relevant body of work, we conclude with
recommendations for open problems.                       We would like to thank Seraphina Goldfarb-Tarrant,
Bias-Aware Data Curation Many works have                 Sunipa Dev, Jason Teoh, members of the Plus Lab,
highlighted the harms and problems when col-             and our anonymous reviewers for the many helpful
lecting training datasets with limited awareness         suggestions that went into this paper.

                                                    4283
Ethics and Broader Implications                             of the The Fourth Widening Natural Language Pro-
                                                            cessing Workshop, pages 99–102, Seattle, USA. As-
In this work, we present a survey and commentary            sociation for Computational Linguistics.
on the progress and challenges for studying societal
                                                          Emily M. Bender and Batya Friedman. 2018. Data
biases in language generation.                              statements for natural language processing: Toward
Data We do not check the quality of the datasets            mitigating system bias and enabling better science.
used to train popular language generation models            Transactions of the Association for Computational
                                                            Linguistics, 6:587–604.
(due to limited availability and size), though we
do briefly mention problems that other works have         Emily M Bender, Timnit Gebru, Angelina McMillan-
found regarding using large datasets that have been         Major, and Shmargaret Shmitchell. 2021. On the
minimally filtered. Some of the surveyed datasets           dangers of stochastic parrots: Can language models
                                                            be too big. Proceedings of FAccT.
and metrics that are used for evaluating biases ap-
proximate binary genders using names typical of           Steven Bird. 2020. Decolonising speech and lan-
specific genders, and may be better re-formulated            guage technology. In Proceedings of the 28th Inter-
                                                             national Conference on Computational Linguistics,
to avoid harms and curate a more accurate repre-             pages 3504–3519, Barcelona, Spain (Online). Inter-
sentation of different genders. On the subject of            national Committee on Computational Linguistics.
genders, the majority of bias evaluation data also
                                                          Su Lin Blodgett, Solon Barocas, Hal Daumé III, and
only evaluate for binary genders—we point out this
                                                            Hanna Wallach. 2020. Language (technology) is
issue in our survey as well.                                power: A critical survey of “bias” in NLP. In Pro-
Techniques Most of the techniques surveyed in               ceedings of the 58th Annual Meeting of the Asso-
this work are trained with or bias-tested with data         ciation for Computational Linguistics, pages 5454–
                                                            5476, Online. Association for Computational Lin-
drawn from Western sources or culture, since that is        guistics.
largely the focus of the existing body of work. We
also refer to studies that point out how techniques       Shikha Bordia and Samuel R. Bowman. 2019. Identify-
for bias do not always transfer across cultures. Our        ing and reducing gender bias in word-level language
                                                            models. In Proceedings of the 2019 Conference of
decoding experiments could potentially fuel mis-            the North American Chapter of the Association for
use by giving those with adversarial interests a            Computational Linguistics: Student Research Work-
better understanding of how decoding algorithms             shop, pages 7–15, Minneapolis, Minnesota. Associ-
could thwart bias metrics, though we believe trans-         ation for Computational Linguistics.
parency around these results outweigh the potential       Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
for misuse.                                                 Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
                                                            Neelakantan, Pranav Shyam, Girish Sastry, Amanda
                                                            Askell, et al. 2020. Language models are few-shot
                                                            learners. arXiv preprint arXiv:2005.14165.
References
Abubakar Abid, Maheen Farooqi, and James Zou.             Nicholas Carlini, Florian Tramer, Eric Wallace,
  2021. Persistent anti-muslim bias in large language       Matthew Jagielski, Ariel Herbert-Voss, Katherine
  models. arXiv preprint arXiv:2101.05783.                  Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
                                                            far Erlingsson, et al. 2020. Extracting training
Bashar Alhafni, Nizar Habash, and Houda Bouamor.            data from large language models. arXiv preprint
  2020. Gender-aware reinflection using linguistically      arXiv:2012.07805.
  enhanced neural models. In Proceedings of the Sec-      L Elisa Celis and Vijay Keswani. 2020. Dialect diver-
  ond Workshop on Gender Bias in Natural Language           sity in text summarization on twitter. arXiv preprint
  Processing, pages 139–150, Barcelona, Spain (On-          arXiv:2007.07860.
  line). Association for Computational Linguistics.
                                                          Amanda Cercas Curry, Judy Robertson, and Verena
Solon Barocas, Kate Crawford, Aaron Shapiro, and           Rieser. 2020. Conversational assistants and gender
  Hanna Wallach. 2017. The problem with bias: Al-          stereotypes: Public perceptions and desiderata for
  locative versus representational harms in machine        voice personas. In Proceedings of the Second Work-
  learning. In 9th Annual Conference of the Special        shop on Gender Bias in Natural Language Process-
  Interest Group for Computing, Information and So-        ing, pages 72–78, Barcelona, Spain (Online). Asso-
  ciety.                                                   ciation for Computational Linguistics.
Christine Basta, Marta R. Costa-jussà, and José A. R.   Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu,
  Fonollosa. 2020. Towards mitigating gender bias in        and Jingjing Liu. 2020.       Distilling knowledge
  a decoder-based neural machine translation model          learned in BERT for text generation. In Proceedings
  by adding contextual information. In Proceedings          of the 58th Annual Meeting of the Association for

                                                     4284
Computational Linguistics, pages 7893–7905, On-         Emily Dinan, Angela Fan, Adina Williams, Jack Ur-
  line. Association for Computational Linguistics.          banek, Douwe Kiela, and Jason Weston. 2020a.
                                                            Queens are powerful too: Mitigating gender bias in
Won Ik Cho, Ji Won Kim, Seok Min Kim, and                   dialogue generation. In Proceedings of the 2020
 Nam Soo Kim. 2019. On measuring gender bias in             Conference on Empirical Methods in Natural Lan-
 translation of gender-neutral pronouns. In Proceed-        guage Processing (EMNLP), pages 8173–8188, On-
 ings of the First Workshop on Gender Bias in Natu-         line. Association for Computational Linguistics.
 ral Language Processing, pages 173–181, Florence,
 Italy. Association for Computational Linguistics.        Emily Dinan, Angela Fan, Ledell Wu, Jason Weston,
Won Ik Cho, Jiwon Kim, Jaeyeong Yang, and Nam Soo           Douwe Kiela, and Adina Williams. 2020b. Multi-
 Kim. 2021. Towards cross-lingual generalization of         dimensional gender bias classification. In Proceed-
 translation gender bias. In Proceedings of the 2021        ings of the 2020 Conference on Empirical Methods
 ACM Conference on Fairness, Accountability, and            in Natural Language Processing (EMNLP), pages
 Transparency, pages 449–457.                               314–331, Online. Association for Computational
                                                            Linguistics.
Prafulla Kumar Choubey, Anna Currey, Prashant
  Mathur, and Georgiana Dinu. 2021. Improving gen-        Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer
  der translation accuracy with filtered self-training.     Reingold, and Richard Zemel. 2012. Fairness
  arXiv preprint arXiv:2104.07695.                          through awareness. In Proceedings of the 3rd inno-
                                                            vations in theoretical computer science conference,
Marta R Costa-jussà, Carlos Escolano, Christine            pages 214–226.
 Basta, Javier Ferrando, Roser Batlle, and Ksenia
 Kharitonova. 2020. Gender bias in multilingual neu-      Mostafa Elaraby, Ahmed Y Tawfik, Mahmoud Khaled,
 ral machine translation: The architecture matters.        Hany Hassan, and Aly Osama. 2018. Gender aware
 arXiv preprint arXiv:2012.13176.                          spoken language translation applied to english-
                                                           arabic.    In 2018 2nd International Conference
Marta R. Costa-jussà and Adrià de Jorge. 2020.
                                                           on Natural Language and Speech Processing (IC-
 Fine-tuning neural machine translation on gender-
                                                           NLSP), pages 1–6. IEEE.
 balanced datasets. In Proceedings of the Second
 Workshop on Gender Bias in Natural Language Pro-
                                                          Joel Escudé Font and Marta R. Costa-jussà. 2019.
 cessing, pages 26–34, Barcelona, Spain (Online).
                                                            Equalizing gender bias in neural machine transla-
 Association for Computational Linguistics.
                                                            tion with word embeddings techniques. In Proceed-
Kate Crawford. 2017. The trouble with bias. Keynote         ings of the First Workshop on Gender Bias in Natu-
  at NeurIPS.                                               ral Language Processing, pages 147–154, Florence,
                                                            Italy. Association for Computational Linguistics.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
  bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.        Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
  Transformer-XL: Attentive language models beyond          erarchical neural story generation. In Proceedings
  a fixed-length context. In Proceedings of the 57th        of the 56th Annual Meeting of the Association for
  Annual Meeting of the Association for Computa-            Computational Linguistics (Volume 1: Long Papers),
  tional Linguistics, pages 2978–2988, Florence, Italy.     pages 889–898.
  Association for Computational Linguistics.
                                                          Anna Farkas and Renáta Németh. 2020. How to mea-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane         sure gender bias in machine translation: Optimal
  Hung, Eric Frank, Piero Molino, Jason Yosinski, and       translators, multiple reference points. arXiv preprint
  Rosanne Liu. 2019. Plug and play language mod-            arXiv:2011.06445.
  els: A simple approach to controlled text generation.
  In International Conference on Learning Represen-       Xavier Ferrer, Tom van Nuenen, Jose M Such, and Na-
  tations.                                                  talia Criado. 2021. Discovering and categorising
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               language biases in reddit. In Proceedings of the In-
   Kristina Toutanova. 2019. BERT: Pre-training of          ternational AAAI Conference on Web and Social Me-
   deep bidirectional transformers for language under-      dia, volume 15.
   standing. In Proceedings of the 2019 Conference
   of the North American Chapter of the Association       Timnit Gebru, Jamie Morgenstern, Briana Vecchione,
   for Computational Linguistics: Human Language            Jennifer Wortman Vaughan, Hanna Wallach, Hal
  Technologies, Volume 1 (Long and Short Papers),           Daumé III, and Kate Crawford. 2018. Datasheets
   pages 4171–4186, Minneapolis, Minnesota. Associ-         for datasets. arXiv preprint arXiv:1803.09010.
   ation for Computational Linguistics.
                                                          Hila Gonen and Kellie Webster. 2020. Automatically
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya            identifying gender issues in machine translation us-
  Krishna, Yada Pruksachatkun, Kai-Wei Chang, and           ing perturbations. In Findings of the Association
  Rahul Gupta. 2021. Bold: Dataset and metrics for          for Computational Linguistics: EMNLP 2020, pages
  measuring biases in open-ended language genera-           1991–1995, Online. Association for Computational
  tion. Proceedings of FAccT.                               Linguistics.

                                                     4285
Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita            Proceedings of the 33rd Annual ACM Conference on
  Honnavalli, Sharon Levy, Diba Mirza, and                  Human Factors in Computing Systems, pages 3819–
  William Yang Wang. 2020. Investigating African-           3828.
  American Vernacular English in transformer-based
  text generation. In Proceedings of the 2020 Confer-     Svetlana Kiritchenko and Saif Mohammad. 2018. Ex-
  ence on Empirical Methods in Natural Language             amining gender and race bias in two hundred sen-
  Processing (EMNLP), pages 5877–5883, Online.              timent analysis systems. In Proceedings of the
  Association for Computational Linguistics.                Seventh Joint Conference on Lexical and Computa-
                                                            tional Semantics, pages 43–53.
Nizar Habash, Houda Bouamor, and Christine Chung.
  2019. Automatic gender identification and reinflec-     Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C
  tion in Arabic. In Proceedings of the First Work-         Fraser. 2020. Confronting abusive language online:
  shop on Gender Bias in Natural Language Process-          A survey from the ethical and human rights perspec-
  ing, pages 155–165, Florence, Italy. Association for      tive. arXiv preprint arXiv:2012.12305.
  Computational Linguistics.
                                                          Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi,
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equal-     Filippo Volpin, Frederic A Dreyer, Aleksandar Sht-
 ity of opportunity in supervised learning. In Ad-          edritski, and Yuki M Asano. 2021. How true is gpt-
 vances in neural information processing systems,           2? an empirical analysis of intersectional occupa-
 pages 3315–3323.                                           tional biases. arXiv preprint arXiv:2102.04130.

Tatsunori Hashimoto, Megha Srivastava, Hongseok           Tom Kocmi, Tomasz Limisiewicz, and Gabriel
  Namkoong, and Percy Liang. 2018. Fairness with-           Stanovsky. 2020. Gender coreference and bias eval-
  out demographics in repeated loss minimization.           uation at WMT 2020. In Proceedings of the Fifth
  In International Conference on Machine Learning,          Conference on Machine Translation, pages 357–364,
  pages 1929–1938. PMLR.                                    Online. Association for Computational Linguistics.

Peter Henderson, Koustuv Sinha, Nicolas Angelard-         Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc-
  Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan           Cann, Nitish Shirish Keskar, Shafiq Joty, Richard
  Lowe, and Joelle Pineau. 2018. Ethical challenges         Socher, and Nazneen Fatema Rajani. 2020. Gedi:
  in data-driven dialogue systems. In Proceedings of        Generative discriminator guided sequence genera-
  the 2018 AAAI/ACM Conference on AI, Ethics, and           tion. arXiv preprint arXiv:2009.06367.
  Society, pages 123–129.
                                                          Sharon Levy, Michael Saxon, and William Yang Wang.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and          2021. The truth is out there: Investigating conspir-
  Yejin Choi. 2019. The curious case of neural text de-     acy theories in text generation. In Findings of The
  generation. In International Conference on Learn-         Joint Conference of the 59th Annual Meeting of the
  ing Representations.                                      Association for Computational Linguistics and the
                                                            11th International Joint Conference on Natural Lan-
Dirk Hovy, Federico Bianchi, and Tommaso Forna-             guage Processing.
  ciari. 2020. “you sound just like your father” com-
  mercial machine translation systems include stylis-     Alisa Liu, Maarten Sap, Ximing Lu, Swabha
  tic biases. In Proceedings of the 58th Annual Meet-       Swayamdipta, Chandra Bhagavatula, Noah A Smith,
  ing of the Association for Computational Linguistics,     and Yejin Choi. 2021. On-the-fly controlled text
  pages 1686–1690, Online. Association for Computa-         generation with experts and anti-experts. In The
  tional Linguistics.                                       Joint Conference of the 59th Annual Meeting of the
                                                            Association for Computational Linguistics and the
Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan-           11th International Joint Conference on Natural Lan-
  forth, Johannes Welbl, Jack Rae, Vishal Maini, Dani       guage Processing.
  Yogatama, and Pushmeet Kohli. 2020. Reducing
  sentiment bias in language models via counterfac-       Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao
  tual evaluation. In Findings of the Association for       Liu, and Jiliang Tang. 2020a. Does gender matter?
  Computational Linguistics: EMNLP 2020, pages              towards fairness in dialogue systems. In Proceed-
  65–83, Online. Association for Computational Lin-         ings of the 28th International Conference on Com-
  guistics.                                                 putational Linguistics, pages 4403–4416, Barcelona,
                                                            Spain (Online). International Committee on Compu-
Clayton Hutto and Eric Gilbert. 2014. Vader: A par-         tational Linguistics.
  simonious rule-based model for sentiment analysis
  of social media text. In Proceedings of the Interna-    Haochen Liu, Wentao Wang, Yiqi Wang, Hui Liu, Zi-
  tional AAAI Conference on Web and Social Media,           tao Liu, and Jiliang Tang. 2020b. Mitigating gender
  volume 8.                                                 bias for neural dialogue generation with adversarial
                                                            learning. In Proceedings of the 2020 Conference on
Matthew Kay, Cynthia Matuszek, and Sean A Munson.           Empirical Methods in Natural Language Processing
 2015. Unequal representation and gender stereo-            (EMNLP), pages 893–903, Online. Association for
 types in image search results for occupations. In          Computational Linguistics.

                                                     4286
You can also read