A Survey of Race, Racism, and Anti-Racism in NLP

Page created by Juan Garza
 
CONTINUE READING
A Survey of Race, Racism, and Anti-Racism in NLP
A Survey of Race, Racism, and Anti-Racism in NLP
                                                                  Anjalie Field                                      Su Lin Blodgett
                                                            Carnegie Mellon University                              Microsoft Research
                                                            anjalief@cs.cmu.edu                              sulin.blodgett@microsoft.com

                                                             Zeerak Waseem                                            Yulia Tsvetkov
                                                           University of Sheffield                                University of Washington
                                                       z.w.butt@sheffield.ac.uk                               yuliats@cs.washington.edu

                                                                  Abstract                                    While researchers and activists have increasingly
                                                 Despite inextricable ties between race and lan-               drawn attention to racism in computer science and
                                                 guage, little work has considered race in NLP                 academia, frequently-cited examples of racial bias
                                                 research and development. In this work, we                    in AI are often drawn from disciplines other than
arXiv:2106.11410v2 [cs.CL] 15 Jul 2021

                                                 survey 79 papers from the ACL anthology that                  NLP, such as computer vision (facial recognition)
                                                 mention race. These papers reveal various                    (Buolamwini and Gebru, 2018) or machine learn-
                                                 types of race-related bias in all stages of NLP               ing (recidivism risk prediction) (Angwin et al.,
                                                 model development, highlighting the need for
                                                                                                               2016). Even the presence of racial biases in search
                                                 proactive consideration of how NLP systems
                                                 can uphold racial hierarchies. However, per-
                                                                                                               engines like Google (Sweeney, 2013; Noble, 2018)
                                                 sistent gaps in research on race and NLP re-                  has prompted little investigation in the ACL com-
                                                 main: race has been siloed as a niche topic                   munity. Work on NLP and race remains sparse,
                                                 and remains ignored in many NLP tasks; most                   particularly in contrast to concerns about gender
                                                 work operationalizes race as a fixed single-                  bias, which have led to surveys, workshops, and
                                                 dimensional variable with a ground-truth label,               shared tasks (Sun et al., 2019; Webster et al., 2019).
                                                 which risks reinforcing differences produced                     In this work, we conduct a comprehensive sur-
                                                 by historical racism; and the voices of histor-
                                                                                                              vey of how NLP literature and research practices
                                                 ically marginalized people are nearly absent in
                                                 NLP literature. By identifying where and how                  engage with race. We first examine 79 papers from
                                                 NLP literature has and has not considered race,               the ACL Anthology that mention the words ‘race’,
                                                 especially in comparison to related fields, our              ‘racial’, or ‘racism’ and highlight examples of how
                                                 work calls for inclusion and racial justice in                racial biases manifest at all stages of NLP model
                                                 NLP research practices.                                       pipelines (§3). We then describe some of the limi-
                                         1       Introduction                                                  tations of current work (§4), specifically showing
                                                                                                               that NLP research has only examined race in a nar-
                                         Race and language are tied in complicated ways.                       row range of tasks with limited or no social context.
                                         Raciolinguistics scholars have studied how they are                   Finally, in §5, we revisit the NLP pipeline with a fo-
                                         mutually constructed: historically, colonial pow-                     cus on how people generate data, build models, and
                                         ers construct linguistic and racial hierarchies to                    are affected by deployed systems, and we highlight
                                         justify violence, and currently, beliefs about the                    current failures to engage with people traditionally
                                         inferiority of racialized people’s language practices                 underrepresented in STEM and academia.
                                         continue to justify social and economic exclusion                        While little work has examined the role of race
                                         (Rosa and Flores, 2017).1 Furthermore, language                       in NLP specifically, prior work has discussed race
                                         is the primary means through which stereotypes                        in related fields, including human-computer in-
                                         and prejudices are communicated and perpetuated                       teraction (HCI) (Ogbonnaya-Ogburu et al., 2020;
                                         (Hamilton and Trolier, 1986; Bar-Tal et al., 2013).                   Rankin and Thomas, 2019; Schlesinger et al.,
                                            However, questions of race and racial bias                         2017), fairness in machine learning (Hanna et al.,
                                         have been minimally explored in NLP literature.                       2020), and linguistics (Charity Hudley et al., 2020;
                                             1
                                              We use racialization to refer the process of “ascribing and      Motha, 2020). We draw comparisons and guid-
                                         prescribing a racial category or classification to an individual      ance from this work and show its relevance to NLP
                                         or group of people . . . based on racial attributes including but
                                         not limited to cultural and social history, physical features,        research. Our work differs from NLP-focused re-
                                         and skin color” (Charity Hudley, 2017).                               lated work on gender bias (Sun et al., 2019), ‘bias’
generally (Blodgett et al., 2020), and the adverse       U.S. However, as race and racism are global con-
impacts of language models (Bender et al., 2021)         structs, some aspects of our analysis are applicable
in its explicit focus on race and racism.                to other contexts. We suggest that future studies
   In surveying research in NLP and related fields,      on racialization in NLP ground their analysis in the
we ultimately find that NLP systems and research         appropriate geo-cultural context, which may result
practices produce differences along racialized lines.    in findings or analyses that differ from our work.
Our work calls for NLP researchers to consider
the social hierarchies upheld and exacerbated by         3     Survey of NLP literature on race
NLP research and to shift the field toward “greater
                                                         3.1    ACL Anthology papers about race
inclusion and racial justice” (Charity Hudley et al.,
2020).                                                   In this section, we introduce our primary survey
                                                         data—papers from the ACL Anthology3 —and we
2       What is race?                                    describe some of their major findings to empha-
                                                         size that NLP systems encode racial biases. We
It has been widely accepted by social scientists that    searched the anthology for papers containing the
race is a social construct, meaning it “was brought      terms ‘racial’, ‘racism’, or ‘race’, discarding ones
into existence or shaped by historical events, social    that only mentioned race in the references section
forces, political power, and/or colonial conquest”       or in data examples and adding related papers cited
rather than reflecting biological or ‘natural’ differ-   by the initial set if they were also in the ACL An-
ences (Hanna et al., 2020). More recent work has         thology. In using keyword searches, we focus on
criticized the “social construction” theory as circu-    papers that explicitly mention race and consider
lar and rooted in academic discourse, and instead        papers that use euphemistic terms to not have sub-
referred to race as “colonial constituted practices”,    stantial engagement on this topic. As our focus
including “an inherited western, modern-colonial         is on NLP and the ACL community, we do not in-
practice of violence, assemblage, superordination,       clude NLP-related papers published in other venues
exploitation and segregation” (Saucier et al., 2016).    in the reported metrics (e.g. Table 1), but we do
   The term race is also multi-dimensional and           draw from them throughout our analysis.
can refer to a variety of different perspectives, in-       Our initial search identified 165 papers. How-
cluding racial identity (how you self-identify), ob-     ever, reviewing all of them revealed that many do
served race (the race others perceive you to be),        not deeply engage on the topic. For example, 37
and reflected race (the race you believe others per-     papers mention ‘racism’ as a form of abusive lan-
ceive you to be) (Roth, 2016; Hanna et al., 2020;        guage or use ‘racist’ as an offensive/hate speech
Ogbonnaya-Ogburu et al., 2020). Racial catego-           label without further engagement. 30 papers only
rizations often differ across dimensions and depend      mention race as future work, related work, or mo-
on the defined categorization schema. For exam-          tivation, e.g. in a survey about gender bias, “Non-
ple, the United States census considers Hispanic         binary genders as well as racial biases have largely
an ethnicity, not a race, but surveys suggest that       been ignored in NLP” (Sun et al., 2019). After
2/3 of people who identify as Hispanic consider          discarding these types of papers, our final analysis
it a part of their racial background.2 Similarly,        set consists of 79 papers.4
the census does not consider ‘Jewish’ a race, but           Table 1 provides an overview of the 79 papers,
some NLP work considers anti-Semitism a form             manually coded for each paper’s primary NLP task
of racism (Hasanuzzaman et al., 2017). Race de-          and its focal goal or contribution. We determined
pends on historical and social context—there are         task/application labels through an iterative process:
no ‘ground truth’ labels or categories (Roth, 2016).     listing the main focus of each paper and then col-
   As the work we survey primarily focuses on the        lapsing similar categories. In cases where papers
United States, our analysis similarly focuses on the
                                                             3
                                                               The ACL Anthology includes papers from all official
    2                                                    ACL venues and some non-ACL events listed in Appendix A,
   https://www.census.gov/mso/
www/training/pdf/race-ethnicity-                         as of December 2020 it included 6, 200 papers
                                                             4
onepager.pdf/,     https://www.census.gov/                     We do not discard all papers about abusive language, only
topics/population/race/about.html,                       ones that exclusively use racism/racist as a classification label.
https://www.pewresearch.org/fact-tank/                   We retain papers with further engagement, e.g. discussions
2015/06/15/is-being-hispanic-a-matter-                   of how to define racism or identification of racial bias in hate
of-race-ethnicity-or-both/                               speech classifiers.
Analyze Corpus

                                                                                                                                              Survey/Position
                                                                                                       Develop Model
                                                                     Collect Corpus

                                                                                                                       Detect Bias

                                                                                                                                     Debias

                                                                                                                                                                Total
           Abusive Language                                           6                4                2               5            2          2               21
           Social Science/Social Media                                2               10                6               1            -          1               20
           Text Representations (LMs, embeddings)                     -               2                 -               9            2          -               13
           Text Generation (dialogue, image captions, story gen. )    -                -               1                5            1          1               8
           Sector-specific NLP applications (edu., law, health)      1                2                 -               -            1          3               7
           Ethics/Task-independent Bias                               1                -                1               1            1          2               6
           Core NLP Applications (parsing, NLI, IE)                  1                 -               1                1            1          -               4
           Total                                                     11               18               11              22            8          9               79

  Table 1: 79 papers on race or racism from the ACL anthology, categorized by NLP application and focal task.

could rightfully be included in multiple categories,        vealing under-representation in training data, some-
we assign them to the best-matching one based on            times tangentially to primary research questions:
stated contributions and the percentage of the paper        Rudinger et al. (2017) suggest that gender bias may
devoted to each possible category. In the Appendix          be easier to identify than racial or ethnic bias in
we provide additional categorizations of the papers         Natural Language Inference data sets because of
according to publication year, venue, and racial            data sparsity, and Caliskan et al. (2017) alter the
categories used, as well as the full list of 79 papers.     Implicit Association Test stimuli that they use to
                                                            measure biases in word embeddings because some
3.2   NLP systems encode racial bias                        African American names were not frequent enough
Next, we present examples that identify racial bias         in their corpora.
in NLP models, focusing on 5 parts of a standard               An equally important consideration, in addition
NLP pipeline: data, data labels, models, model out-         to whom the data describes is who authored the
puts, and social analyses of outputs. We include            data. For example, Blodgett et al. (2018) show
papers described in Table 1 and also relevant liter-        that parsing systems trained on White Mainstream
ature beyond the ACL Anthology (e.g. NeurIPS,               American English perform poorly on African
PNAS, Science). These examples are not intended             American English (AAE).5 In a more general exam-
to be exhaustive, and in §4 we describe some of the         ple, Wikipedia has become a popular data source
ways that NLP literature has failed to engage with          for many NLP tasks. However, surveys suggest
race, but nevertheless, we present them as evidence         that Wikipedia editors are primarily from white-
that NLP systems perpetuate harmful biases along            majority countries,6 and several initiatives have
racialized lines.                                           pointed out systemic racial biases in Wikipedia
                                                            coverage (Adams et al., 2019; Field et al., 2021).7
Data A substantial amount of prior work has al-             Models trained on these data only learn to process
ready shown how NLP systems, especially word                the type of text generated by these users, and fur-
embeddings and language models, can absorb and              ther, only learn information about the topics these
amplify social biases in data sets (Bolukbasi et al.,       users are interested in. The representativeness of
2016; Zhao et al., 2017). While most work focuses           data sets is a well-discussed issue in social-oriented
on gender bias, some work has made similar ob-              tasks, like inferring public opinion (Olteanu et al.,
servations about racial bias (Rudinger et al., 2017;        2019), but this issue is also an important considera-
Garg et al., 2018; Kurita et al., 2019). These studies
focus on how training data might describe racial                5
                                                                  We note that conceptualizations of AAE and the accom-
minorities in biased ways, for example, by exam-            panying terminology for the variety have shifted considerably
ining words associated with terms like ‘black’ or           in the last half century; see King (2020) for an overview.
                                                                6
traditionally European/African American names                     https://meta.wikimedia.org/wiki/
                                                            Research:Wikipedia Editors Survey 2011 April
(Caliskan et al., 2017; Manzini et al., 2019). Some             7
                                                                  https://en.wikipedia.org/wiki/
studies additionally capture who is described, re-          Racial bias on Wikipedia
tion in ‘neutral’ tasks like parsing (Waseem et al.,      vestigation of results is needed to ascertain which
2021). The type of data that researchers choose           factors most contribute to disparate performance.
to train their models on does not just affect what
data the models perform well for, it affects what
people the models work for. NLP researchers can-          Model Outputs Several papers focus on model
not assume models will be useful or function for          outcomes, and how NLP systems could perpetuate
marginalized people unless they are trained on data       and amplify bias if they are deployed:
generated by them.
                                                             • Classifiers trained on common abusive lan-
Data Labels Although model biases are often                    guage data sets are more likely to label tweets
blamed on raw data, several of the papers we survey            containing characteristics of AAE as offensive
identify biases in the way researchers categorize or           (Davidson et al., 2019; Sap et al., 2019).
obtain data annotations. For example:                        • Classifiers for abusive language are more
   • Annotation schema Returning to Blodgett                   likely to label text containing identity terms
     et al. (2018), this work defines new parsing              like ‘black’ as offensive (Dixon et al., 2018).
     standards for formalisms common in AAE,                 • GPT outputs text with more negative senti-
     demonstrating how parsing labels themselves               ment when prompted with AAE -like inputs
     were not designed for racialized language va-             (Groenwold et al., 2020).
     rieties.
   • Annotation instructions Sap et al. (2019)
     show that annotators are less likely to label        Social Analyses of Outputs While the examples
     tweets using AAE as offensive if they are            in this section primarily focus on racial biases in
     told the likely language varieties of the tweets.    trained NLP systems, other work (e.g. included
     Thus, how annotation schemes are designed            in ‘Social Science/Social Media’ in Table 1) uses
     (e.g. what contextual information is provided)       NLP tools to analyze race in society. Examples in-
     can impact annotators’ decisions, and fail-          clude examining how commentators describe foot-
     ing to provide sufficient context can result         ball players of different races (Merullo et al., 2019)
     in racial biases.                                    or how words like ‘prejudice’ have changed mean-
   • Annotator selection Waseem (2016) show               ing over time (Vylomova et al., 2019).
     that feminist/anti-racist activists assign differ-
                                                             While differing in goals, this work is often sus-
     ent offensive language labels to tweets than
                                                          ceptible to the same pitfalls as other NLP tasks.
     figure-eight workers, demonstrating that an-
                                                          One area requiring particular caution is in the in-
     notators’ lived experiences affect data annota-
                                                          terpretation of results produced by analysis models.
     tions.                                               For example, while word embeddings have become
                                                          a common way to measure semantic change or es-
Models Some papers have found evidence that
                                                          timate word meanings (Garg et al., 2018), Joseph
model instances or architectures can change the
                                                          and Morgan (2020) show that embedding associ-
racial biases of outputs produced by the model.
                                                          ations do not always correlate with human opin-
Sommerauer and Fokkens (2019) find that the word
                                                          ions; in particular, correlations are stronger for be-
embedding associations around words like ‘race’
                                                          liefs about gender than race. Relatedly, in HCI,
and ‘racial’ change not only depending on the
                                                          the recognition that authors’ own biases can affect
model architecture used to train embeddings, but
                                                          their interpretations of results has caused some au-
also on the specific model instance used to extract
                                                          thors to provide self-disclosures (Schlesinger et al.,
them, perhaps because of differing random seeds.
                                                          2017), but this practice is uncommon in NLP.
Kiritchenko and Mohammad (2018) examine gen-
der and race biases in 200 sentiment analysis sys-           We conclude this section by observing that when
tems submitted to a shared task and find different        researchers have looked for racial biases in NLP
levels of bias in different systems. As the train-        systems, they have usually found them. This litera-
ing data for the shared task was standardized, all        ture calls for proactive approaches in considering
models were trained on the same data. However,            how data is collected, annotated, used, and inter-
participants could have used external training data       preted to prevent NLP systems from exacerbating
or pre-trained embeddings, so a more detailed in-         historical racial hierarchies.
4     Limitations in where and how NLP                            et al., 2019; Blodgett et al., 2018; Xia et al., 2020;
      operationalizes race                                        Xu et al., 2019; Groenwold et al., 2020), but even
                                                                  this corpus is explicitly not intended to infer race.
While §3 demonstrates ways that NLP systems
                                                                      Furthermore, names and hand-selected iden-
encode racial biases, we next identify gaps and lim-
                                                                  tity terms are not sufficient for uncovering model
itations in how these works have examined racism,
                                                                  bias. De-Arteaga et al. (2019) show this in ex-
focusing on how and in what tasks researchers have
                                                                  amining gender bias in occupation classification:
considered race. We ultimately conclude that prior
                                                                  when overt indicators like names and pronouns are
NLP literature has marginalized research on race
                                                                  scrubbed from the data, performance gaps and po-
and encourage deeper engagement with other fields,
                                                                  tential allocational harms still remain. Names also
critical views of simplified classification schema,
                                                                  generalize poorly. While identity terms can be ex-
and broader application scope in future work (Blod-
                                                                  amined across languages (van Miltenburg et al.,
gett et al., 2020; Hanna et al., 2020).
                                                                  2017), differences in naming conventions often do
4.1    Common data sets are narrow in scope                       not translate, leading some studies to omit examin-
                                                                  ing racial bias in non-English languages (Lauscher
The papers we surveyed suggest that research on
                                                                  and Glavaš, 2019). Even within English, names of-
race in NLP has used a very limited range of
                                                                  ten fail to generalize across domains, geographies,
data sets, which fails to account for the multi-
                                                                  and time. For example, names drawn from the
dimensionality of race and simplifications inher-
                                                                  U.S. census generalize poorly to Twitter (Wood-
ent in classification. We identified 3 common data
                                                                  Doughty et al., 2018), and names common among
sources:8
                                                                  Black and white children were not distinctly differ-
    • 9 papers use a set of tweets with inferred prob-
                                                                  ent prior to the 1970s (Fryer Jr and Levitt, 2004;
      abilistic topic labels based on alignment with
                                                                  Sweeney, 2013).
      U.S. census race/ethnicity groups (or the pro-
      vided inference model) (Blodgett et al., 2016).                 We focus on these 3 data sets as they were
    • 11 papers use lists of names drawn from                     most common in the papers we surveyed, but
      Sweeney (2013), Caliskan et al. (2017), or                  we note that others exist. Preoţiuc-Pietro and
      Garg et al. (2018). Most commonly, 6 pa-                    Ungar (2018) provide a data set of tweets with
      pers use African/European American names                    self-identified race of their authors, though it is
      from the Word Embedding Association Test                    little used in subsequent work and focused on
      (WEAT) (Caliskan et al., 2017), which in turn               demographic prediction, rather than evaluating
      draws data from Greenwald et al. (1998) and                 model performance gaps. Two recently-released
      Bertrand and Mullainathan (2004).                           data sets (Nadeem et al., 2020; Nangia et al.,
    • 10 papers use explicit keywords like ‘Black                 2020) provide crowd-sourced pairs of more- and
      woman’, often placed in templates like “I am a              less-stereotypical text. More work is needed to
              ” to test if model performance remains              understand any privacy concerns and the strengths
      the same for different identity terms.                      and limitations of these data (Blodgett et al., 2021).
                                                                  Additionally, some papers collect domain-specific
   While these commonly-used data sets can iden-
                                                                  data, such as self-reported race in an online com-
tify performance disparities, they only capture a
                                                                  munity (Loveys et al., 2018), or crowd-sourced
narrow subset of the multiple dimensions of race
                                                                  annotations of perceived race of football players
(§2). For example, none of them capture self-
                                                                  (Merullo et al., 2019). While these works offer
identified race. While observed race is often appro-
                                                                  clear contextualization, it is difficult to use these
priate for examining discrimination and some types
                                                                  data sets to address other research questions.
of disparities, it is impossible to assess potential
harms and benefits of NLP systems without assess-
ing their performance over text generated by and                  4.2   Classification schemes operationalize
directed to people of different races. The corpus                       race as a fixed, single-dimensional
from Blodgett et al. (2016) does serve as a start-                      U.S.-census label
ing point and forms the basis of most current work                Work that uses the same few data sets inevitably
assessing performance gaps in NLP models (Sap                     also uses the same few classification schemes, often
   8
     We provide further counts of what racial categories papers   without justification. The most common explicitly
use and how they operationalize them in Appendix B.               stated source of racial categories is the U.S. census,
which reflects the general trend of U.S.-centrism        privileged people (e.g. Black men), while consid-
in NLP research (the vast majority of work we sur-       eration of gender emphasizes the experience of
veyed also focused on English). While census cate-       race-privileged people (e.g. white women). Nei-
gories are sometimes appropriate, repeated use of        ther reflect the experience of people who face dis-
classification schemes and accompanying data sets        crimination along both axes (e.g. Black women)
without considering who defined these schemes            (Crenshaw, 1989). A small selection of papers have
and whether or not they are appropriate for the cur-     examined intersectional biases in embeddings or
rent context risks perpetuating the misconception        word co-occurrences (Herbelot et al., 2012; May
that race is ‘natural’ across geo-cultural contexts.     et al., 2019; Tan and Celis, 2019; Lepori, 2020), but
We refer to Hanna et al. (2020) for a more thorough      we did not identify mentions of intersectionality in
overview of the harms of “widespread uncritical          any other NLP research areas. Further, several of
adoption of racial categories,” which “can in turn       these papers use NLP technology to examine or val-
re-entrench systems of racial stratification which       idate theories on intersectionality; they do not draw
give rise to real health and social inequalities.” At    from theory on intersectionality to critically exam-
best, the way race has been operationalized in NLP       ine NLP models. These omissions can mask harms:
research is only capable of examining a narrow sub-      Jiang and Fellbaum (2020) provide an example us-
set of potential harms. At worst, it risks reinforcing   ing word embeddings of how failing to consider in-
racism by presenting racial divisions as natural,        tersectionality can render invisible people marginal-
rather than the product of social and historical con-    ized in multiple ways. Numerous directions remain
text (Bowker and Star, 2000).                            for exploration, such as how ‘debiasing’ models
                                                         along one social dimension affects other dimen-
   As an example of questioning who devised racial
                                                         sions. Surveys in HCI offer further frameworks
categories and for what purpose, we consider the
                                                         on how to incorporate identity and intersectional-
pattern of re-using names from Greenwald et al.
                                                         ity into computational research (Schlesinger et al.,
(1998), who describe their data as sets of names
                                                         2017; Rankin and Thomas, 2019).
“judged by introductory psychology students to be
more likely to belong to White Americans than to         4.3    NLP research on race is restricted to
Black Americans” or vice versa. When incorpo-                   specific tasks and applications
rating this data into WEAT, Caliskan et al. (2017)
                                                         Finally, Table 1 reveals many common NLP appli-
discard some judged African American names as
                                                         cations where race has not been examined, such as
too infrequent in their embedding data. Work sub-
                                                         machine translation, summarization, or question an-
sequently drawing from WEAT makes no mention
                                                         swering.9 While some tasks seem inherently more
of the discarded names nor contains much discus-
                                                         relevant to social context than others (a claim we
sion of how the data was generated and whether or
                                                         dispute in this work, particularly in §5), research on
not names judged to be white or Black by introduc-
                                                         race is compartmentalized to limited areas of NLP
tory psychology students in 1998 are an appropriate
                                                         even in comparison with work on ‘bias’. For exam-
benchmark for the studied task. While gathering
                                                         ple, Blodgett et al. (2020) identify 20 papers that
data to examine race in NLP is challenging, and in
                                                         examine bias in co-reference resolution systems
this work we ourselves draw from examples that
                                                         and 8 in machine translation, whereas we identify
use Greenwald et al. (1998), it is difficult to inter-
                                                         0 papers in either that consider race. Instead, race
pret what implications arise when models exhibit
                                                         is most often mentioned in NLP papers in the con-
disparities over this data and to what extent models
                                                         text of abusive language, and work on detecting or
without disparities can be considered ‘debiased’.
                                                         removing bias in NLP models has focused on word
   Finally, almost all of the work we examined con-      embeddings.
ducts single-dimensional analyses, e.g. focus on            Overall, our survey identifies a need for the ex-
race or gender but not both simultaneously. This         amination of race in a broader range of NLP tasks,
focus contrasts with the concept of intersection-        the development of multi-dimensional data sets,
ality, which has shown that examining discrim-           and careful consideration of context and appropri-
ination along a single axis fails to capture the         ateness of racial categories. In general, race is
experiences of people who face marginalization              9
                                                             We identified only 8 relevant papers on Text Generation,
along multiple axes. For example, consideration          which focus on other areas including chat bots, GPT-2/3, hu-
of race often emphasizes the experience of gender-       mor generation, and story generation.
difficult to operationalize, but NLP researchers do        researchers could easily repeat this incident, for
not need to start from scratch, and can instead draw       example, by using demographic profiling of social
from relevant work in other fields.                        media users to create more diverse data sets. While
                                                           obtaining diverse, representative, real-world data
5        NLP propagates marginalization of                 sets is important for building models, data must
         racialized people                                 be collected with consideration for the people who
                                                           generated it, such as obtaining informed consent,
While in §4 we primarily discuss race as a topic or
                                                           setting limits of uses, and preserving privacy, as
a construct, in this section, we consider the role, or
                                                           well as recognizing that some communities may
more pointedly, the absence, of traditionally under-
                                                           not want their data used for NLP at all (Paullada,
represented people in NLP research.
                                                           2020).
5.1       People create data
                                                           5.2    People build models
As discussed in §3.2, data and annotations are gen-
erated by people, and failure to consider who cre-         Research is additionally carried out by people who
ated data can lead to harms. In §3.2 we identify           determine what projects to pursue and how to
a need for diverse training data in order to ensure        approach them. While statistics on ACL confer-
models work for a diverse set of people, and in §4         ences and publications have focused on geographic
we describe a similar need for diversity in data that      representation rather than race, they do highlight
is used to assess algorithmic fairness. However,           under-representation. Out of 2, 695 author affili-
gathering this type of data without consideration of       ations associated with papers in the ACL Anthol-
the people who generated it can introduce privacy          ogy for 5 major conferences held in 2018, only 5
violations and risks of demographic profiling.             (0.2%) were from Africa, compared with 1, 114
   As an example, in 2019, partially in response           from North America (41.3%).11 Statistics pub-
to research showing that facial recognition al-            lished for 2017 conference attendees and ACL fel-
gorithms perform worse on darker-skinned than              lows similarly reveal a much higher percentage
lighter-skinned people (Buolamwini and Gebru,              of people from “North, Central and South Amer-
2018; Raji and Buolamwini, 2019), researchers              ica” (55% attendees / 74% fellows) than from “Eu-
at IBM created the “Diversity in Faces” data set,          rope, Middle East and Africa” (19%/13%) or “Asia-
which consists of 1 million photos sampled from            Pacific” (23%/13%).12 These broad regional cate-
the the publicly available YFCC-100M data set and          gories likely mask further under-representation, e.g.
annotated with “craniofacial distances, areas and          percentage of attendees and fellows from Africa
ratios, facial symmetry and contrast, skin color,          as compared to Europe. According to an NSF re-
age and gender predictions” (Merler et al., 2019).         port that includes racial statistics rather than na-
While this data set aimed to improve the fairness          tionality, 14% of doctorate degrees in Computer
of facial recognition technology, it included pho-         Science awarded by U.S. institutions to U.S. cit-
tos collected from a Flickr, a photo-sharing web-          izens and permanent residents were awarded to
site whose users did not explicitly consent for this       Asian students, < 4% to Black or African Ameri-
use of their photos. Some of these users filed a           can students, and 0% to American Indian or Alaska
lawsuit against IBM, in part for “subjecting them          Native students (National Center for Science and
to increased surveillance, stalking, identity theft,       Engineering Statistics, 2019).13
and other invasions of privacy and fraud.”10 NLP              It is difficult to envision reducing or eliminating
    10
                                                           racial differences in NLP systems without changes
    https://www.classaction.org/news/                      in the researchers building these systems. One
class-action-accuses-ibm-of-flagrant-
violations-of-illinois-biometric-                          theory that exemplifies this challenge is interest
privacy-law-to-develop-facial-                             convergence, which suggests that people in posi-
recognition-tech#embedded-document
https://www.nbcnews.com/tech/internet/
                                                           tions of power only take action against systematic
facial-recognition-s-dirty-little-                           11
secret-millions-online-photos-scraped-                           http://www.marekrei.com/blog/
n981921 IBM has since removed the “Diversity in Faces”     geographic-diversity-of-nlp-conferences/
                                                              12
data set as well as their “Detect Faces” public API and          https://www.aclweb.org/portal/content/
stopped their use of and research on facial recognition.   acl-diversity-statistics
                                                              13
https://qz.com/1866848/why-ibm-abandoned-                        Results exclude respondents who did not report race or
its-facial-recognition-program/                            ethnicity or were Native Hawaiian or Other Pacific Islander.
problems like racism when it also advances their          tools for predicting demographic information (Tat-
own interests (Bell Jr, 1980). Ogbonnaya-Ogburu           man, 2020) and automatic prison term prediction
et al. (2020) identify instances of interest conver-      (Leins et al., 2020), motivated by the history of
gence in the HCI community, primarily in diversity        using technology to police racial minorities and re-
initiatives that benefit institutions’ images rather      lated criticism in other fields (Browne, 2015; Buo-
than underrepresented people. In a research setting,      lamwini and Gebru, 2018; McIlwain, 2019). In
interest convergence can encourage studies of incre-      cases where potential harms are less direct, they
mental and surface-level biases while discouraging        are often unaddressed entirely. For example, while
research that might be perceived as controversial         low-resource NLP is a large area of research, a
and force fundamental changes in the field.               paper on machine translation of white American
   Demographic statistics are not sufficient for          and European languages is unlikely to discuss how
avoiding pitfalls like interest convergence, as they      continual model improvements in these settings in-
fail to capture the lived experiences of researchers.     crease technological inequality. Little work on low-
Ogbonnaya-Ogburu et al. (2020) provide several            resource NLP has focused on the realities of struc-
examples of challenges that non-white HCI re-             tural racism or differences in lived experience and
searchers have faced, including the invisible labor       how they might affect the way technology should
of representing ‘diversity’, everyday microaggres-        be designed.
sions, and altering their research directions in ac-         Detection of abusive language offers an infor-
cordance with their advisors’ interests. Rankin and       mative case study on the danger of failing to con-
Thomas (2019) further discuss how research con-           sider people affected by technology. Work on abu-
ducted by people of different races is perceived dif-     sive language often aims to detect racism for con-
ferently: “Black women in academia who conduct            tent moderation (Waseem and Hovy, 2016). How-
research about the intersections of race, gender,         ever, more recent work has show that existing hate
class, and so on are perceived as ‘doing service,’        speech classifiers are likely to falsely label text con-
whereas white colleagues who conduct the same re-         taining identity terms like ‘black’ or text containing
search are perceived as doing cutting-edge research       linguistic markers of AAE as toxic (Dixon et al.,
that demands attention and recognition.” While we         2018; Sap et al., 2019; Davidson et al., 2019; Xia
draw examples about race from HCI in the absence          et al., 2020). Deploying these models could censor
of published work on these topics in NLP, the lack        the posts of the very people they purport to help.
of linguistic diversity in NLP research similarly
demonstrates how representation does not neces-               In other areas of statistics and machine learning,
sarily imply inclusion. Although researchers from         focus on participatory design has sought to am-
various parts of the world (Asia, in particular) do       plify the voices of people affected by technology
have some numerical representation among ACL              and its development. An ICML 2020 workshop
authors, attendees, and fellows, NLP research over-       titled “Participatory Approaches to Machine Learn-
whelmingly favors a small set of languages, with          ing” highlights a number of papers in this area
a heavy skew towards European languages (Joshi            (Kulynych et al., 2020; Brown et al., 2019). A
et al., 2020) and ‘standard’ language varieties (Ku-      few related examples exist in NLP, e.g. Gupta et al.
mar et al., 2021).                                        (2020) gather data for an interactive dialogue agent
                                                          intended to provide more accessible information
                                                          about heart failure to Hispanic/Latinx and African
5.3   People use models
                                                          American patients. The authors engage with health-
Finally, NLP research produces technology that is         care providers and doctors, though they leave focal
used by people, and even work without direct ap-          groups with patients for future work. While NLP
plications is typically intended for incorporation        researchers may not be best situated to examine
into application-based systems. With the recogni-         how people interact with deployed technology, they
tion that technology ultimately affects people, re-       could instead draw motivation from fields that have
searchers on ethics in NLP have increasingly called       stronger histories of participatory design, such as
for considerations of whom technology might harm          HCI. However, we did not identify citing participa-
and suggested that there are some NLP technolo-           tory design studies conducted by others as common
gies that should not be built at all. In the context of   practice in the work we surveyed. As in the case
perpetuating racism, examples include criticism of        of researcher demographics, participatory design is
not an end-all solution. Sloane et al. (2020) provide    draw from linguistics, Charity Hudley et al. (2020)
a discussion of how participatory design can col-        in turn call on linguists to draw models of racial
lapse to ‘participation-washing’ and how such work       justice from anthropology, sociology, and psychol-
must be context-specific, long-term, and genuine.        ogy. Relatedly, there are numerous racialized ef-
                                                         fects that NLP research can have that we do not
6   Discussion                                           address in this work; for example, Bender et al.
                                                         (2021) and Strubell et al. (2019) discuss the envi-
We conclude by synthesizing some of the obser-           ronmental costs of training large language models,
vations made in the preceding sections into more         and how global warming disproportionately affects
actionable items. First, NLP research needs to           marginalized communities. We suggest that read-
explicitly incorporate race. We quote Benjamin           ers use our work as one starting point for bringing
(2019): “[technical systems and social codes] op-        inclusion and racial justice into NLP.
erate within powerful systems of meaning that ren-
der some things visible, others invisible, and create    Acknowledgements
a vast array of distortions and dangers.”
   In the context of NLP research, this philosophy       We gratefully thank Hanna Kim, Kartik Goyal, Ar-
implies that all technology we build works in ser-       tidoro Pagnoni, Qinlan Shen, and Michael Miller
vice of some ideas or relations, either by upholding     Yoder for their feedback on this work. Z.W. has
them or dismantling them. Any research that is           been supported in part by the Canada 150 Research
not actively combating prevalent social systems          Chair program and the UK-Canada Artificial Intel-
like racism risks perpetuating or exacerbating them.     ligence Initiative. A.F. has been supported in part
Our work identifies several ways in which NLP            by a Google PhD Fellowship and a GRFP under
research upholds racism:                                 Grant No. DGE1745016. This material is based
    • Systems contain representational harms and         upon work supported in part by the National Sci-
      performance gaps throughout NLP pipelines          ence Foundation under Grants No. IIS2040926 and
    • Research on race is restricted to a narrow sub-    IIS2007960. Any opinions, findings, and conclu-
      set of tasks and definitions of race, which can    sions or recommendations expressed in this mate-
      mask harms and falsely reify race as ‘natural’     rial are those of the authors and do not necessarily
    • Traditionally underrepresented people are ex-      reflect the views of the NSF.
      cluded from the research process, both as con-
      sumers and producers of technology                 7   Ethical Considerations
   Furthermore, while we focus on race, which            We, the authors of this work, are situated in the
we note has received substantially less attention        cultural contexts of the United States of America
than gender, many of the observations in this work       and the United Kingdom/Europe, and some of us
hold for social characteristics that have received       identify as people of color. We all identify as NLP
even less attention in NLP research, such as so-         researchers, and we acknowledge that we are situ-
cioeconomic class, disability, or sexual orientation     ated within the traditionally exclusionary practices
(Mendelsohn et al., 2020; Hutchinson et al., 2020).      of academic research. These perspectives have im-
   Nevertheless, none of these challenges can be ad-     pacted our work, and there are viewpoints outside
dressed without direct engagement with marginal-         of our institutions and experiences that our work
ized communities of color. NLP researchers can           may not fully represent.
draw on precedents for this type of engagement
from other fields, such as participatory design and
value sensitive design models (Friedman et al.,          References
2013). Additionally, numerous organizations al-
                                                         Julia Adams, Hannah Brückner, and Cambria Naslund.
ready exist that serve as starting points for partner-      2019. Who counts as a notable sociologist on
ships, such as Black in AI, Masakhane, Data for             Wikipedia? gender, race, and the “professor test”.
Black Lives, and the Algorithmic Justice League.            Socius, 5.
   Finally, race and language are complicated, and
                                                         Silvio Amir, Mark Dredze, and John W. Ayers. 2019.
while readers may look for clearer recommenda-              Mental health surveillance over social media with
tions, no one data set, model, or set of guidelines         digital cohorts. In Proceedings of the Sixth Work-
can ‘solve’ racism in NLP. For instance, while we           shop on Computational Linguistics and Clinical Psy-
chology, pages 114–120, Minneapolis, Minnesota.        Su Lin Blodgett, Lisa Green, and Brendan O’Connor.
  Association for Computational Linguistics.               2016. Demographic dialectal variation in social
                                                           media: A case study of African-American English.
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren         In Proceedings of the 2016 Conference on Empiri-
   Kirchner. 2016. Machine bias: There’s software          cal Methods in Natural Language Processing, pages
   used across the country to predict future criminals     1119–1130, Austin, Texas. Association for Compu-
   and it’s biased against blacks. ProPublica.             tational Linguistics.
Stavros Assimakopoulos, Rebecca Vella Muskat, Lon-       Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu,
   neke van der Plas, and Albert Gatt. 2020. Annotat-      Robert Sim, and Hanna Wallach. 2021. Stereotyp-
   ing for hate speech: The MaNeCo corpus and some         ing Norwegian Salmon: An Inventory of Pitfalls in
   input from critical discourse analysis. In Proceed-     Fairness Benchmark Datasets. In Proceedings of the
   ings of the 12th Language Resources and Evaluation      Joint Conference of the 59th Annual Meeting of the
   Conference, pages 5088–5097, Marseille, France.         Association for Computational Linguistics and the
   European Language Resources Association.                11th International Joint Conference on Natural Lan-
Daniel Bar-Tal, Carl F Graumann, Arie W Kruglanski,        guage Processing, Online. Association for Compu-
  and Wolfgang Stroebe. 2013. Stereotyping and prej-       tational Linguistics.
  udice: Changing conceptions. Springer Science &        Su Lin Blodgett, Johnny Wei, and Brendan O’Connor.
  Business Media.                                          2018. Twitter Universal Dependency parsing for
Francesco Barbieri and Jose Camacho-Collados. 2018.        African-American and mainstream American En-
  How gender and skin tone modifiers affect emoji se-      glish. In Proceedings of the 56th Annual Meeting of
  mantics in Twitter. In Proceedings of the Seventh        the Association for Computational Linguistics (Vol-
  Joint Conference on Lexical and Computational Se-        ume 1: Long Papers), pages 1415–1425, Melbourne,
  mantics, pages 101–106, New Orleans, Louisiana.          Australia. Association for Computational Linguis-
  Association for Computational Linguistics.               tics.

Derrick A Bell Jr. 1980. Brown v. board of education     Tolga Bolukbasi, Kai-Wei Chang, James Zou,
  and the interest-convergence dilemma. Harvard law        Venkatesh Saligrama, and Adam Kalai. 2016.
  review, pages 518–533.                                   Man is to computer programmer as woman is to
                                                           homemaker? Debiasing word embeddings. In
Emily Bender, Timnit Gebru, Angelina McMillan-             Proceedings of the 30th International Confer-
  Major, and Shmargaret Shmitchell. 2021. On the           ence on Neural Information Processing Systems,
  dangers of stochastic parrots: Can language models       page 4356–4364, Red Hook, NY, USA. Curran
  be too big? . In Proceedings of the 2021 Confer-         Associates Inc.
  ence on Fairness, Accountability, and Transparency,
  page 610–623, New York, NY, USA. Association for       Rishi Bommasani, Kelly Davis, and Claire Cardie.
  Computing Machinery.                                     2020. Interpreting Pretrained Contextualized Repre-
                                                           sentations via Reductions to Static Embeddings. In
Ruha Benjamin. 2019. Race After Technology: Aboli-         Proceedings of the 58th Annual Meeting of the Asso-
  tionist Tools for the New Jim Code. Wiley.               ciation for Computational Linguistics, pages 4758–
                                                           4781, Online. Association for Computational Lin-
Shane Bergsma, Mark Dredze, Benjamin Van Durme,            guistics.
  Theresa Wilson, and David Yarowsky. 2013.
  Broadly improving user classification via              Geoffrey C. Bowker and Susan Leigh Star. 2000.
  communication-based name and location clus-              Sorting Things Out: Classification and Its Conse-
  tering on Twitter. In Proceedings of the 2013            quences. Inside Technology. MIT Press.
  Conference of the North American Chapter of the
  Association for Computational Linguistics: Human       Anna Brown, Alexandra Chouldechova, Emily Putnam-
  Language Technologies, pages 1010–1019, Atlanta,         Hornstein, Andrew Tobin, and Rhema Vaithianathan.
  Georgia. Association for Computational Linguistics.      2019. Toward algorithmic accountability in pub-
                                                           lic services: A qualitative study of affected commu-
Marianne Bertrand and Sendhil Mullainathan. 2004.          nity perspectives on algorithmic decision-making in
 Are Emily and Greg more employable than Lak-              child welfare services. In Proceedings of the 2019
 isha and Jamal? A field experiment on labor mar-          CHI Conference on Human Factors in Computing
 ket discrimination. American Economic Review,             Systems, CHI ’19, page 1–12, New York, NY, USA.
 94(4):991–1013.                                           Association for Computing Machinery.
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and      Simone Browne. 2015. Dark Matters: On the Surveil-
  Hanna Wallach. 2020. Language (technology) is            lance of Blackness. Duke University Press.
  power: A critical survey of “bias” in NLP. In Pro-
  ceedings of the 58th Annual Meeting of the Asso-       Joy Buolamwini and Timnit Gebru. 2018. Gender
  ciation for Computational Linguistics, pages 5454–       shades: Intersectional accuracy disparities in com-
  5476, Online. Association for Computational Lin-         mercial gender classification. In Proceedings of
  guistics.                                                the 1st Conference on Fairness, Accountability and
Transparency, pages 77–91, New York, NY, USA.               shootings. In Proceedings of the 2019 Conference
  PMLR.                                                       of the North American Chapter of the Association
                                                              for Computational Linguistics: Human Language
Aylin Caliskan, Joanna J. Bryson, and Arvind                  Technologies, Volume 1 (Long and Short Papers),
  Narayanan. 2017. Semantics derived automatically            pages 2970–3005, Minneapolis, Minnesota. Associ-
  from language corpora contain human-like biases.            ation for Computational Linguistics.
  Science, 356(6334):183–186.
Michael Castelle. 2018. The linguistic ideologies of        Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,
  deep abusive language classification. In Proceed-           and Lucy Vasserman. 2018. Measuring and mitigat-
  ings of the 2nd Workshop on Abusive Language On-            ing unintended bias in text classification. In Pro-
  line (ALW2), pages 160–170, Brussels, Belgium. As-          ceedings of the 2018 AAAI/ACM Conference on AI,
  sociation for Computational Linguistics.                    Ethics, and Society, page 67–73, New York, NY,
                                                              USA. Association for Computing Machinery.
Bharathi Raja Chakravarthi. 2020. HopeEDI: A mul-
  tilingual hope speech detection dataset for equality,     Jacob Eisenstein, Noah A. Smith, and Eric P. Xing.
  diversity, and inclusion. In Proceedings of the Third        2011. Discovering sociolinguistic associations with
  Workshop on Computational Modeling of People’s               structured sparsity. In Proceedings of the 49th An-
  Opinions, Personality, and Emotion’s in Social Me-           nual Meeting of the Association for Computational
  dia, pages 41–53, Barcelona, Spain (Online). Asso-           Linguistics: Human Language Technologies, pages
  ciation for Computational Linguistics.                      1365–1374, Portland, Oregon, USA. Association for
                                                               Computational Linguistics.
Anne H. Charity Hudley. 2017. Language and Racial-
  ization. In Ofelia Garcı́a, Nelson Flores, and Mas-       Yanai Elazar and Yoav Goldberg. 2018. Adversarial
  similiano Spotti, editors, The Oxford Handbook of           removal of demographic attributes from text data.
  Language and Society, pages 381–402. Oxford Uni-            In Proceedings of the 2018 Conference on Empiri-
  versity Press.                                              cal Methods in Natural Language Processing, pages
                                                              11–21, Brussels, Belgium. Association for Computa-
Anne H. Charity Hudley, Christine Mallinson, and
                                                              tional Linguistics.
  Mary Bucholtz. 2020. Toward racial justice in lin-
  guistics: Interdisciplinary insights into theorizing
                                                            Anjalie Field, Chan Young Park, and Yulia Tsvetkov.
  race in the discipline and diversifying the profession.
                                                              2021. Controlled analyses of social biases in
  Language, 96(4):e200–e235.
                                                              Wikipedia bios. Computing Research Repository,
Isobelle Clarke and Jack Grieve. 2017. Dimensions of          arXiv:2101.00078. Version 1.
   abusive language on Twitter. In Proceedings of the
   First Workshop on Abusive Language Online, pages         Batya Friedman, Peter Kahn, Alan Borning, and Alina
   1–10, Vancouver, BC, Canada. Association for Com-          Huldtgren. 2013. Value sensitive design and infor-
   putational Linguistics.                                    mation systems. In Neelke Doorn, Daan Schuur-
                                                              biers, Ibo van de Poel, and Michael Gorman, editors,
Kimberlé Crenshaw. 1989. Demarginalizing the inter-          Early engagement and new technologies: Opening
  section of race and sex: A black feminist critique of       up the laboratory, volume 16. Springer, Dordrecht.
  antidiscrimination doctrine, feminist theory and an-
  tiracist politics. University of Chicago Legal Forum,     Roland G Fryer Jr and Steven D Levitt. 2004. The
  1989(8).                                                    causes and consequences of distinctively black
                                                              names.    The Quarterly Journal of Economics,
Thomas Davidson, Debasmita Bhattacharya, and Ing-             119(3):767–805.
  mar Weber. 2019. Racial bias in hate speech and
  abusive language detection datasets. In Proceedings       Ryan J. Gallagher, Kyle Reing, David Kale, and Greg
  of the Third Workshop on Abusive Language Online,           Ver Steeg. 2017. Anchored correlation explanation:
  pages 25–35, Florence, Italy. Association for Com-          Topic modeling with minimal domain knowledge.
  putational Linguistics.                                     Transactions of the Association for Computational
Maria De-Arteaga, Alexey Romanov, Hanna Wal-                  Linguistics, 5:529–542.
 lach, Jennifer Chayes, Christian Borgs, Alexandra
 Chouldechova, Sahin Geyik, Krishnaram Kentha-              Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
 padi, and Adam Tauman Kalai. 2019. Bias in bios:             James Zou. 2018.     Word embeddings quantify
 A case study of semantic representation bias in a            100 years of gender and ethnic stereotypes. Pro-
 high-stakes setting. In Proceedings of the Confer-           ceedings of the National Academy of Sciences,
 ence on Fairness, Accountability, and Transparency,          115(16):E3635–E3644.
 page 120–128, New York, NY, USA. Association for
 Computing Machinery.                                       Ona de Gibert, Naiara Perez, Aitor Garcı́a-Pablos, and
                                                              Montse Cuadros. 2018. Hate speech dataset from
Dorottya Demszky, Nikhil Garg, Rob Voigt, James               a white supremacy forum. In Proceedings of the
  Zou, Jesse Shapiro, Matthew Gentzkow, and Dan Ju-           2nd Workshop on Abusive Language Online (ALW2),
  rafsky. 2019. Analyzing polarization in social me-          pages 11–20, Brussels, Belgium. Association for
  dia: Method and application to tweets on 21 mass            Computational Linguistics.
Nabeel Gillani and Roger Levy. 2019. Simple dynamic         12th Language Resources and Evaluation Confer-
  word embeddings for mapping perceptions in the            ence, pages 1440–1448, Marseille, France. Euro-
  public sphere. In Proceedings of the Third Work-          pean Language Resources Association.
  shop on Natural Language Processing and Compu-
  tational Social Science, pages 94–99, Minneapolis,      Ben Hutchinson, Vinodkumar Prabhakaran, Emily
  Minnesota. Association for Computational Linguis-         Denton, Kellie Webster, Yu Zhong, and Stephen De-
  tics.                                                     nuyl. 2020. Social biases in NLP models as barriers
                                                            for persons with disabilities. In Proceedings of the
Anthony G Greenwald, Debbie E McGhee, and Jor-              58th Annual Meeting of the Association for Compu-
  dan LK Schwartz. 1998. Measuring individual dif-          tational Linguistics, pages 5491–5501, Online. As-
  ferences in implicit cognition: the implicit associa-     sociation for Computational Linguistics.
  tion test. Journal of personality and social psychol-
  ogy, 74(6):1464.                                        May Jiang and Christiane Fellbaum. 2020. Interdepen-
                                                           dencies of gender and race in contextualized word
Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita
                                                           embeddings. In Proceedings of the Second Work-
  Honnavalli, Sharon Levy, Diba Mirza, and
                                                           shop on Gender Bias in Natural Language Process-
  William Yang Wang. 2020. Investigating African-
                                                           ing, pages 17–25, Barcelona, Spain (Online). Asso-
  American Vernacular English in transformer-based
                                                           ciation for Computational Linguistics.
  text generation. In Proceedings of the 2020 Confer-
  ence on Empirical Methods in Natural Language
  Processing (EMNLP), pages 5877–5883, Online.            Kenneth Joseph and Jonathan Morgan. 2020. When do
  Association for Computational Linguistics.                word embeddings accurately reflect surveys on our
                                                            beliefs about people? In Proceedings of the 58th An-
Itika Gupta, Barbara Di Eugenio, Devika Salunke, An-        nual Meeting of the Association for Computational
   drew Boyd, Paula Allen-Meares, Carolyn Dickens,          Linguistics, pages 4392–4415, Online. Association
   and Olga Garcia. 2020. Heart failure education           for Computational Linguistics.
   of African American and Hispanic/Latino patients:
   Data collection and analysis. In Proceedings of the    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
   First Workshop on Natural Language Processing for        Bali, and Monojit Choudhury. 2020. The state and
   Medical Conversations, pages 41–46, Online. Asso-        fate of linguistic diversity and inclusion in the NLP
   ciation for Computational Linguistics.                   world. In Proceedings of the 58th Annual Meet-
                                                            ing of the Association for Computational Linguistics,
David L Hamilton and Tina K Trolier. 1986. Stereo-          pages 6282–6293, Online. Association for Computa-
  types and stereotyping: An overview of the cogni-         tional Linguistics.
  tive approach. In J. F. Dovidiom and S. L. Gaert-
  ner, editors, Prejudice, discrimination, and racism,    David Jurgens, Libby Hemphill, and Eshwar Chan-
  pages 127––163. Academic Press.                           drasekharan. 2019. A just and comprehensive strat-
                                                            egy for using NLP to address online abuse. In Pro-
Alex Hanna, Emily Denton, Andrew Smart, and Jamila          ceedings of the 57th Annual Meeting of the Asso-
  Smith-Loud. 2020. Towards a critical race method-         ciation for Computational Linguistics, pages 3658–
  ology in algorithmic fairness. In Proceedings of the      3666, Florence, Italy. Association for Computa-
  2020 Conference on Fairness, Accountability, and          tional Linguistics.
  Transparency, page 501–512, New York, NY, USA.
  Association for Computing Machinery.                    Saket Karve, Lyle Ungar, and João Sedoc. 2019. Con-
                                                            ceptor debiasing of word representations evaluated
Mohammed Hasanuzzaman, Gaël Dias, and Andy
                                                            on WEAT. In Proceedings of the First Workshop
 Way. 2017. Demographic word embeddings for
                                                            on Gender Bias in Natural Language Processing,
 racism detection on Twitter. In Proceedings of
                                                            pages 40–48, Florence, Italy. Association for Com-
 the Eighth International Joint Conference on Natu-
                                                            putational Linguistics.
 ral Language Processing (Volume 1: Long Papers),
 pages 926–936, Taipei, Taiwan. Asian Federation of
 Natural Language Processing.                             Anna Kasunic and Geoff Kaufman. 2018. Learning to
                                                            listen: Critically considering the role of AI in human
Aurélie Herbelot, Eva von Redecker, and Johanna            storytelling and character creation. In Proceedings
  Müller. 2012. Distributional techniques for philo-       of the First Workshop on Storytelling, pages 1–13,
  sophical enquiry. In Proceedings of the 6th Work-         New Orleans, Louisiana. Association for Computa-
  shop on Language Technology for Cultural Heritage,        tional Linguistics.
  Social Sciences, and Humanities, pages 45–54, Avi-
  gnon, France. Association for Computational Lin-        Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Da-
  guistics.                                                 vani, Morteza Dehghani, and Xiang Ren. 2020. Con-
                                                            textualizing hate speech classifiers with post-hoc ex-
Xiaolei Huang, Linzi Xing, Franck Dernoncourt, and          planation. In Proceedings of the 58th Annual Meet-
  Michael J. Paul. 2020. Multilingual Twitter cor-          ing of the Association for Computational Linguistics,
  pus and baselines for evaluating demographic bias         pages 5435–5442, Online. Association for Computa-
  in hate speech recognition. In Proceedings of the         tional Linguistics.
You can also read