Challenges in Detoxifying Language Models

Page created by Esther Ellis

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Challenges in Detoxifying Language Models

Challenges in Detoxifying Language Models

                                                      Johannes Welbl∗ Amelia Glaese∗ Jonathan Uesato∗ Sumanth Dathathri∗
                                                            John Mellor∗ Lisa Anne Hendricks Kirsty Anderson
                                                                 Pushmeet Kohli Ben Coppin Po-Sen Huang∗
                                                                                  DeepMind
                                                            {welbl,glamia,juesato,sdathath,johnme,posenhuang}@deepmind.com

                                                                  Abstract
                                                 Large language models (LM) generate remark-
                                                 ably fluent text and can be efficiently adapted
                                                 across NLP tasks. Measuring and guarantee-
arXiv:2109.07445v1 [cs.CL] 15 Sep 2021

                                                 ing the quality of generated text in terms of
                                                 safety is imperative for deploying LMs in the
                                                 real world; to this end, prior work often re-
                                                 lies on automatic evaluation of LM toxicity.
                                                 We critically discuss this approach, evaluate
                                                 several toxicity mitigation strategies with re-          Figure 1: Unintended side effect of automatic toxi-
                                                 spect to both automatic and human evaluation,            city reduction methods: Over-filtering of text about
                                                 and analyze consequences of toxicity mitiga-             marginalized groups reduces the ability of the LM to
                                                 tion in terms of model bias and LM quality.              generate text about these groups, even in a positive way.
                                                 We demonstrate that while basic intervention
                                                 strategies can effectively optimize previously
                                                 established automatic metrics on the R EAL -             by steering a model’s generation towards text less
                                                 T OXICITY P ROMPTS dataset, this comes at the            likely to be classified as toxic (Dathathri et al.,
                                                 cost of reduced LM coverage for both texts               2020; Krause et al., 2021; Schick et al., 2021), or
                                                 about, and dialects of, marginalized groups.             through direct test-time filtering (Xu et al., 2021).
                                                 Additionally, we find that human raters often
                                                                                                          Recently, Gehman et al. (2020) introduced auto-
                                                 disagree with high automatic toxicity scores
                                                 after strong toxicity reduction interventions—           matic metrics for LM toxicity evaluation based on
                                                 highlighting further the nuances involved in             toxicity scores of the widely used and commer-
                                                 careful evaluation of LM toxicity.                       cially deployed P ERSPECTIVE API model trained
                                                                                                          on online comments annotated for toxicity.2
                                         1       Introduction                                                In this paper, we critically discuss both toxi-
                                         Contemporary text generation models (Radford                     city evaluation and mitigation for contemporary
                                         et al., 2019; Brown et al., 2020) are capable of gen-            transformer-based English LMs. We conduct stud-
                                         erating harmful language, including hate speech, in-             ies with both human annotation and classifier-based
                                         sults, profanities and threats (Gehman et al., 2020).            evaluation, to evaluate the effectiveness of different
                                         These harms are often grouped under the umbrella                 toxicity mitigation methods, and investigate trade-
                                         term “toxicity”.1                                                offs with respect to LM quality and social bias. Our
                                            To enable safe language model (LM) use and                    contributions are as follows:
                                         deployment, it is necessary to measure, understand
                                                                                                            1. We critically discuss LM toxicity evaluation
                                         the origins, and undertake effective steps to miti-
                                                                                                               (§3) and conduct evaluation studies for sev-
                                         gate toxic text generation in LMs. Prior work has
                                                                                                               eral mitigation methods (§4), relying both on
                                         considered various approaches towards reducing
                                                                                                               automatic toxicity scores (§5) and on human
                                         LM toxicity, either by fine-tuning a pre-trained
                                                                                                               judgement (§6).
                                         LM (Gehman et al., 2020; Gururangan et al., 2020),
                                                 ∗
                                                Denotes equal contribution.                                 2. We show that combinations of simple meth-
                                             1
                                              Although broad, this term typically does not capture less        ods (§4) are very effective in optimizing (au-
                                         obvious, but no less important harms—such as subtle or distri-
                                                                                                             2
                                         butional biases (Sap et al., 2019b; Sheng et al., 2019; Huang         Perspective API was developed          by   Jigsaw
                                         et al., 2020; Abid et al., 2021).                                (https://perspectiveapi.com)

tomatic) toxicity metrics (§5), but prone to 2018; Park et al., 2018). A second type of bias con-
overfilter texts related to marginalized groups siders disparate performance across dialects, where
(§8). classifiers on average assign higher toxicity scores
e.g. to African-American English (AAE) (David-
3. We find increased disagreement of high auto- son et al., 2019; Sap et al., 2019a). A potential
matic toxicity scores with human annotators side-effect of applying classifier-based toxicity mit-
once strong toxicity reduction measures are igation methods in an LM context, then, is that
applied, limiting their usefulness as a metric such biases might also be inherited by the resulting
for further mitigation of toxicity (§6). model.
4. We show that a reduction in (automatic) toxi- Our findings are consistent with contemporary
city scores comes at a cost. We identify both work by Xu et al. (2021) demonstrating that LM
a trade-off with LM evaluation loss (§7), and toxicity mitigations can amplify social biases. Our
further show that this disproportionately af- work expands these results across a broader range
fects texts about and by marginalized groups of models, demographics, and datasets, and uses
(§8): both topic-related and dialect-related Wikipedia metadata (Dhamala et al., 2021) rather
LM biases increase, as illustrated in Figure 1. than keyword-matching for measuring topic-related
biases. We also show that models which perform
2 Related Work well under our and their likelihood-based metrics
can still exacerbate bias. Finally, by upsampling
While detecting hate speech and offensive lan- toxic samples, we can estimate overall LM tox-
guage (Warner and Hirschberg, 2012; Kwok and icity, whereas a comparison-based approach can
Wang, 2013; Davidson et al., 2017; Zampieri et al., emphasize minor changes to already non-toxic LM
2019), mostly in the context of online community completions.
moderation, has long been a subject of research; the Other work on toxicity in generated text includes
study of toxic text generated by language models is Xu et al. (2020), who investigate safety specifically
a more recent direction. Wallace et al. (2019) first in a dialogue setting, and translating existing offen-
demonstrated that synthetic text prompts can cause sive text into non-offensive variants (Nogueira dos
racist model continuations with GPT-2. Gehman Santos et al., 2018; Laugier et al., 2021).
et al. (2020) extended the analysis of LM toxic-
ity to non-synthetic prompts, further investigating 3 Toxic Language and LMs
the effectiveness of multiple potential mitigation
approaches. We build on, and extend this work, Toxicity Following the definition developed by
critically discussing previously introduced metrics P ERSPECTIVE API, we consider an utterance to be
to assess LM toxicity, and compare classifier-based toxic if it is rude, disrespectful, or unreasonable
LM toxicity scoring with human evaluation. language that is likely to make someone leave a
Among the most promising approaches for LM discussion. This definition has been adopted by
toxicity reduction is steering generation towards prior work on LM toxicity (Gehman et al., 2020),
text less likely to be classified as toxic (Dathathri and allows for direct comparability of quantitative
et al., 2020; Krause et al., 2021). This typically results. However, we note two important caveats.
relies on an external toxicity classifier, although First, under this definition, toxicity judge-
Schick et al. (2021) show that even a LM’s own ments are subjective, and depend on both the
toxicity self-diagnosis can be used to this end. raters evaluating toxicity and their cultural back-
Toxic language detection systems are known to ground (Thomas, 1983), as well as the inferred
be biased against specific social groups, and simi- context. As an example, historical inequalities
lar to Zhou et al. (2021), we distinguish two bias could lead to a higher toleration of offensive speech
types. First, classification bias can manifest as among disadvantaged groups, and measurements of
topic-related biases, where text mentioning partic- toxicity should consider such potential disparities.
ular identities leads to false positives in toxicity Phenomena where subjective toxicity ratings can
classifiers—e.g. LGBTQ+ identity terms (“gay”). differ include sarcasm and utterances of political
This phenomenon has been linked to an increased discontent; we show some example utterances in
relative prevalence of identity terms among toxic Table 12 in the appendix. While not the focus of
samples (Waseem and Hovy, 2016; Dixon et al., this paper, it is important for future work to con-

tinue to develop the above definition, and clarify batch size of 256 for a total of 3 × 105 training
how it can be fairly applied in different contexts. steps—about 5 days. For all sampling we use
Second, this notion of toxicity only covers one nucleus sampling (Holtzman et al., 2020), with
aspect of possible LM harms (Bender et al., 2021). top-p = 0.9.
For example, LMs can perpetuate harmful stereo-
types, or display biases which only manifest sta- 4.1 LM Toxicity Reduction Techniques
tistically over many samples (Sheng et al., 2019; Training Set Filtering In this intervention, we
Huang et al., 2020; Abid et al., 2021). Though train LMs on different versions of the C4 corpus,
important, we do not address these here. filtered for toxicity according to P ERSPECTIVE
LM safety criteria are both application- and API scores. We denote these subsets as train-
audience-specific, and in this regard, we recom- filter@X, indicating that documents with toxicity
mend caution in over-generalizing results from our scores above X are removed—lower values of X
work, particularly regarding the absolute and rela- denote stronger filtering.3 We choose 0.2, 0.1, and
tive efficacy of specific techniques. These caveats 0.05 as thresholds for filtering the training data,
are consistent with the limitations our experiments after which 311M (85%), 209M (57%), and 78M
highlight: regarding the relationship between hu- (22%) of the original training C4 documents re-
man and automatic toxic evaluation (Section 6), main. We did not see indications of overfitting on
and the trade-offs between toxicity mitigation and these smaller datasets.
coverage for marginalized groups (Section 8).
Decoder / Test-Time Filtering We also consider
Evaluating LM Toxicity In this work, we con- filtering LM outputs directly at decoding / test-time,
sider both automatic and human evaluation to mea- and denote this baseline as test-filter. To avoid
sure a LM’s tendency to produce toxic language. using P ERSPECTIVE API for both filtering and
Automatic evaluation can give a first, low-cost evaluation, we filter with a separate BERT-based
indication of toxicity and is useful for particular toxicity classifier (Devlin et al. (2019), denoted
types of research, such as narrowly focused steer- as BERT in this work), which is finetuned for 1
ing methods (Dathathri et al., 2020; Krause et al., epoch with a learning rate of 2×10−5 on the C IVIL -
2021). However, we ultimately care about the im- C OMMENTS dataset (Borkan et al., 2019), using
pacts of LMs on people, so the benefits of toxicity 16 Google Cloud TPUv3 cores. Following Wul-
reduction must ultimately be defined by human czyn et al. (2017), we use soft labels, based on
judgement. An important consideration for human the fraction of annotators rating each comment as
evaluation is that the annotation process itself can toxic, and a cross entropy training objective. The
impose emotional burden on annotators exposed classifier achieves an accuracy of 96.8% on the
to toxic content (Dang et al., 2018; Steiger et al., validation set. We first generate up to K samples
2021). In Section 10.1 we discuss our strategies to from the LM, stopping generation when a sample
ensure the annotators’ well-being. with BERT toxicity score below τreject = 0.01 is
found.4 If we do not obtain such a continuation
4 Model and Methods with a low BERT toxicity score (lower scores are
We next describe the LM we evaluate, as well as better), we return the sample with the lowest BERT
three methods we consider for reducing the LM’s toxicity score.
toxicity, covering both data-based, controllable gen- Plug-and-Play Language Models (PPLM):
eration, and direct filtering-based approaches. We also evaluate PPLM (Dathathri et al., 2020),
Our standard LM is a TransformerXL which was the strongest decoding-based method
model (Dai et al., 2019) trained on the C4 in Gehman et al. (2020). Given the hidden
dataset (Raffel et al., 2020), with 24 layers, 16 representations from a base LM, PPLM uses an
heads, dmodel = 2048, and dff = 8192. The additional linear discriminator trained to predict
model contains 1.4B parameters, and achieves toxicity. When trained on top of our standard LM,
a loss-per-token of 2.40 on the C4 validation this model achieves a test F1 score of 0.78. PPLM
set. It uses a 32,000 subword vocabulary with a
3
SentencePiece tokenizer (Kudo and Richardson, Using BERT (cf. Decoder Filtering) to filter the training
data is another possible setup. We use P ERSPECTIVE API as
2018). We train all LM variants on 128 Google it most closely matches the target in automatic evaluation.
4
Cloud TPUv3 cores using the Adam optimizer, a For computational reasons, we use K = 4 throughout.

Expected Maximum Toxicity          Probability of Toxicity
           Category          Model                             Unprompted Toxic Non-Toxic      Unprompted Toxic Non-Toxic
                             †
           Baselines          GPT-2                               0.44     0.75      0.51          0.33       0.88      0.48
                             †
                              GPT-2 + PPLM                        0.28     0.52      0.32          0.05       0.49      0.17
                             standard (C4)                        0.35     0.72      0.47          0.16       0.87      0.44
           Train filtering   train-filter@0.2                     0.30     0.58      0.40          0.09       0.63      0.28
                             train-filter@0.1                     0.32     0.55      0.36          0.11       0.56      0.20
                             train-filter@0.05                    0.24     0.47      0.33          0.04       0.41      0.17
           Decoder           standard + test-filter               0.21     0.42      0.25          0.01       0.31      0.05
                             train-filter@0.2 + test-filter       0.19     0.35      0.23          0.01       0.16      0.02
                             train-filter@0.1 + test-filter       0.19     0.33      0.22          0.01       0.13      0.02
                             train-filter@0.05 + test-filter      0.17     0.28      0.20          0.01       0.08      0.01
           PPLM +            standard (C4)                        0.26     0.66      0.37          0.05       0.76      0.25
                             standard + test-filter               0.18     0.38      0.22          0.01       0.23      0.03
                             train-filter@0.05                    0.15     0.43      0.27          0.01       0.37      0.09
                             train-filter@0.05 + test-filter      0.11     0.25      0.18          0.00       0.08      0.01

Table 1: Left: Expected Maximum Toxicity over 25 generations. Right: Probability of generating toxic text
at least once over 25 generations. The best performing detoxification method yielding the lowest toxicity per-
category is marked in bold. All models are evaluated on a full dataset of 100K prompts and 100K unprompted
sentences, except PPLM, which is evaluated on a dataset of 10K prompted and 10K unprompted continuations,
due to computational budget. Results marked with † are taken from Gehman et al. (2020).

uses this discriminator to steer the LM’s hidden                             Given these scores, RTP reports two metrics:
representations towards a direction of both low                           i) Expected Maximum Toxicity measures the max-
predicted toxicity, and low KL-divergence from the                        imum toxicity score given 25 continuations for a
original LM prediction. PPLM hyperparameters                              given prompt, averaged across prompts; ii) Proba-
are tuned similar to Madotto et al. (2020), and we                        bility of Toxicity measures how frequently at least
refer to Appendix A.2 for additional details.                             one continuation has a toxicity score > 0.5, given
                                                                          25 LM-generated continuations per prompt.
5       Classifier-Based Toxicity Evaluation
                                                                          5.1     Automatic Evaluation Results
Although our primary targets are based on human
evaluation of LM toxicity, described in Section 6,                        Table 1 shows results for the three different toxicity
we first describe our evaluation using automatic tox-                     mitigation approaches, and combinations of them,
icity metrics for consistency with prior work. We                         alongside baselines including the strongest prior
note that several limitations of automated toxicity-                      method as reported by Gehman et al. (2020).
detection tools have been well documented, both                              First, we observe slightly reduced toxicity rates
by Jigsaw and by other work (Sap et al., 2019a;                           in the standard model trained on C4, compared to
Gehman et al., 2020).                                                     GPT-2 (e.g. 0.16 vs. 0.33 unprompted Probability
   For automated, classifier-based toxicity evalu-                        of Toxicity). This aligns with the overall higher
ation we rely on the R EALT OXICITY P ROMPTS                              proportion of toxic documents (score ≥ 0.5) in the
(RTP) benchmark (Gehman et al., 2020). The aim                            GPT-2 training corpus, which Gehman et al. (2020)
is to measure LM toxicity within a 20 token con-                          report at 4.3%, compared to C4 at 0.6%.6 Filtering
tinuation, in both the prompt-conditional and un-                         the C4 train set based on classifier-based toxicity
conditional settings. For the conditional case, RTP                       leads to further reduced LM toxicity scores, which
consists of 100K English web language prompts,                            also tend to be lower with stronger data filters. This
with each prompt labelled as either toxic or non-                         confirms that toxic training data directly affects the
toxic. The RTP metrics are derived from the P ER -                        resulting LM’s rate of toxicity.
SPECTIVE API toxicity classifier, which outputs a                            Decoder filtering and PPLM are both highly ef-
calibrated T OXICITY score between 0 and 1.5                              fective at reducing the automatic toxicity metrics,
    5
      It is worth noting that the TOXICITY scores provided                across all generation settings. The different meth-
by P ERSPECTIVE API are calibrated and intended to reflect
the probability of the given text being toxic. That is, text with         prediction is uncertain.
                                                                              6
a score of 0.7 does not indicate that the toxicity level of the                 C4 has been filtered based on a keyword list that includes
sample is more severe than that of text with score 0.5; but               insults, vulgar terms and slurs, but such keyword-based filter-
instead that the classifier has more certainty in its prediction          ing also excludes non-toxic uses for some of these terms, and
for the former case, and that for the latter case the model’s             this can potentially affect the coverage of the resulting LMs.

Average Human toxicity scores
ods yield complementary improvements: e.g. de-                                                       standard LM             train-filter@0.05          train-filter@.05+test-filter
                                                                                          0.35       train-filter@0.2        test-filter                train-filter@0.05+test-filter+PPLM
                                                                                                     train-filter@0.1        PPLM
coder filtering further improves already reduced                                          0.30
scores obtained via train filtering alone; PPLM—                                          0.25
when combined with these methods—results in the                                           0.20
largest reductions in toxicity overall.                                                   0.15
   As a central takeaway, the three detoxification                                        0.10
methods and their combinations can effectively op-                                        0.05
                                                                                              0.06            0.08            0.10               0.12           0.14             0.16
timize automatic toxicity evaluation metrics. In                                                                        Average Perspective API scores
relative terms, the reduction to the previously re-       Figure 2: Average human toxicity scores vs. P ERSPEC -
ported state-of-the-art (Gehman et al., 2020) is 6-       TIVE API scores for the different methods we evaluate.
fold and 17-fold in the toxic prompt and non-toxic
prompt settings, and a reduction to 0.00 (from 0.05)      Methodology We use aggregated human-
in the unprompted setting (Probability of Toxic-          judgement to measure the quality of the generated
ity). Given how low these scores are in absolute          text and the extent of toxicity present. For the
terms (e.g. Probability of Toxicity scores of 0.00        human toxicity evaluation we rely on previous
and 0.01 in the unprompted and non-toxic prompt           annotation instructions by P ERSPECTIVE API,7
settings), the question arises to what extent im-         but we adapt them slightly for the context of LM
provements here are still meaningful, especially          generation, including additional questions on
since they are derived from an imperfect automatic        comprehensibility, consistency, and grammaticality.
classification system. We thus turn to a human            For each of the LMs under consideration, we
evaluation study in Section 6.                            provide both a prompt from the R EALT OXIC -
5.2    Limitations and Recommendations                    ITY P ROMPTS dataset, and the corresponding
                                                          continuation generated by the LM to three separate
We next highlight shortcomings in the above used
                                                          annotators. We then ask the annotators to judge
automated toxicity evaluation protocol, and provide
                                                          whether the continuation adds to the toxicity
suggestions for improvement.
                                                          present in the prompt with one of the following
   First, we observed that sampling only 20 tokens,
                                                          labels: V ERY T OXIC , T OXIC , N OT S URE , N OT
as was done in prior work (Gehman et al., 2020),
                                                          T OXIC, matching the annotation labels used by
can provide insufficient context to form a toxicity
                                                          P ERSPECTIVE API. We further ask the annotators
judgement. Second, a hard truncation after a fixed
                                                          to rate if the sentences are i) grammatical, ii)
number of word-piece tokens, can truncate words
                                                          comprehensible, and iii) consistent in terms
at the sequence end (e.g. “ass”), which can erro-
                                                          of topicality and style with the labels: Y ES ,
neously trigger automatic toxicity classifiers. In Ta-
                                                          S OMEWHAT, N O. Here, we wish to address the
ble 6 (appendix), we thus provide analogous auto-
                                                          following questions: i) how effective are toxicity
mated toxicity evaluation results when using longer
                                                          reduction techniques based on human ratings? ii)
text samples and truncating incomplete sentences at
                                                          how do automated evaluations align with human
the end of each sample—with overall similar obser-
                                                          evaluation? and iii) what qualitative impacts are
vations. In our subsequent human evaluation, we
                                                          there on the language generated?
use the same setup to avoid the above issues, and
                                                             As most P ERSPECTIVE API scores for detox-
observed that with longer text continuations, the
                                                          ified LMs are relatively small, random sampling
agreement between automatic scores and human
                                                          leads to very few samples with high scores, and
ratings tends to increase (Figure 6, appendix).
                                                          we would not be able to compare different toxicity
   Finally, we point out that toxicity classifiers such
                                                          ranges efficiently. Hence, we up-sample contin-
as P ERSPECTIVE API, when applied on LM output,
                                                          uations with high classifier-based toxicity scores
are operating outside their training domain and in-
                                                          when selecting texts to present to annotators. In to-
tended use case, which consists of annotated forum
                                                          tal, we prepare 300 samples for each setting. From
or discussion comments.
                                                          a pool of 49 annotators overall, each sample is
6     Evaluation via Human Annotation                     rated by at least 3 annotators, then we discard N OT
                                                            7
Following the previous section on automated LM                https://github.com/conversationai/
                                                          conversationai.github.io/blob/
toxicity evaluation, we will next measure toxicity        8a88f1fc0a/crowdsourcing_annotation_
and LM generation quality using human evaluation.         schemes/toxicity_with_subattributes.md

Toxicity level (annotated by humans)
                                          very_toxic    toxic      not_sure           not_toxic
                            100%
Percent rated by humans

                                                                                      6 11 15
 with each toxicity level
                                                             19             10
                            75%                33 57 51 28        19                              4
                                        97                             20        13
                            50%    46217
                                                             32                       11 32 34
                                                                  15        21
                            25%                21 35 29 19                                        5
                                                           8                     6 4
                                   4   39 28    4 8 12 8 5                           8 8
                             0%                                                                       Figure 4: False positive analysis: avg. P ERSPECTIVE
                                   0             0.25           0.50              0.75                API vs. human score, with std. error, for annotated sam-
                                                  Perspective API score
Figure 3: Human rating distributions vs P ERSPECTIVE                                                  ples where the continuation toxicity (Persp.) is > 0.75.
API scores for the standard LM. Bars are labelled with                                                Note that annotated samples will differ from the over-
the number of human ratings in each bin.                                                              all RTP distribution due to the upsampling procedure
                                                                                                      described in the Methodology part of Section 6.

S URE annotations, map N OT T OXIC to 0.0 and
both T OXIC and V ERY T OXIC to 1.0, and take the
average.8 We weigh the annotations to compensate
for up-sampling. Detailed human annotation in-                                                        False Positives Notably, in the higher toxicity
structions, and a full description of the up-sampling                                                 score range we find that the human and P ERSPEC -
setup are given in Appendix E.                                                                        TIVE API scores differ substantially after LM
                                                                                                      detoxification. Figure 4 shows the average P ER -
Results In Figure 2 we present the overall av-
                                                                                                      SPECTIVE API vs. average human scores for LM-
erage toxicity scores from human annotations
                                                                                                      generated continuations that have a P ERSPECTIVE
vs. those of P ERSPECTIVE API. A central obser-
                                                                                                      API score > 0.75. Human annotations indicate
vation is that the various LM toxicity reduction
                                                                                                      that far fewer samples are toxic than the automatic
methods indeed result in improvements in toxicity
                                                                                                      score might suggest, and this effect is stronger as
ratings according to human judgement, and there
                                                                                                      intervention strength increases, or when multiple
is furthermore a direct and largely monotonic rela-
                                                                                                      methods are combined. That is, after the appli-
tion between average human and classifier-based
                                                                                                      cation of strong toxicity reduction measures, the
results. Next, in Figure 3, we show the alignment of
                                                                                                      majority of samples predicted as likely toxic are
P ERSPECTIVE API scores with human ratings for
                                                                                                      false positives. Several such examples are shown
samples of the standard LM. As expected (cf. foot-
                                                                                                      in Tables 13 and 14 in the appendix.
note 5), the scores are correlated with the probabil-
ity that humans mark a sample toxic.
                                                                                                          Manual inspection reveals that identity term men-
Annotation Quality Measuring agreement be-                                                            tions are disproportionately frequent false positives.
tween raters, we find a Krippendorff’s alpha score                                                    For example, we observe that 30.2% of the train-
of 0.49 for the standard LM, and of 0.48 for all                                                      filter@0.05 LM generations with a toxicity score
annotations across LMs. To calculate these, we                                                        above 0.5 mention the word gay, when generating
map the N OT T OXIC label to 0.0, N OT S URE to                                                       continuations based on R EALT OXICITY P ROMPTS
0.5, T OXIC and V ERY T OXIC to 1.0, using abso-                                                      prompts (see Appendix G.1 for additional analysis).
lute differences between these as distance func-                                                      A reliance on automatic metrics alone, like those
tion. Overall, very few cases were labeled as N OT                                                    used by Gehman et al. (2020), could thus lead to
S URE (about 1%). The score indicates fair overall                                                    potentially misleading interpretations. As we will
agreement, and is comparable to the level of agree-                                                   see in the following Sections 7 and 8, detoxifica-
ment reported in prior work (Ross et al., 2016;                                                       tion measures can result in a higher LM loss and
Wulczyn et al., 2017). We note that toxicity rat-                                                     amplified social biases. It is unclear whether fur-
ing has subjective aspects, and even with improved                                                    ther reductions in the fraction of generated samples
definitions, experts may disagree—for a concrete                                                      with high automatic scores would in fact also fur-
list of phenomena for which we observed annotator                                                     ther lower toxicity as judged by human annotators,
disagreement we defer to Appendix E.3.                                                                or instead only exacerbate the problems incurred
    8
      We acknowledge that other aggregation options are possi-                                        by applying detoxification measures without pro-
ble, e.g. whether any annotator rates a sample as toxic.                                              viding meaningful reductions in LM toxicity.

7 Consequences on LM Quality Model C4 low mid high WT103
standard 1.4B 2.37 2.30 2.43 2.62 2.87
To understand consequences of applying LM toxic-
train-filter@0.2 2.42 2.33 2.49 3.16 2.93
ity interventions, and their potential impact on text
train-filter@0.1 2.48 2.32 2.59 3.28 2.97
generation, we next consider their effect on LM train-filter@0.05 2.66 2.47 2.80 3.52 3.14
loss, text sample quality, and LM toxicity predic-
standard 417M 2.62 2.55 2.68 2.91 3.19
tion ability.
Table 2: Evaluation loss for standard and train-filtered
Effect on Language Modeling Loss Table 2 LMs, across different test sets. Low / mid / high cor-
shows validation losses for several train-filtered respond to [0-.1); [.1-.5); [.5-1] toxicity bins in C4.
models. The first observation is that training set WT103: WikiText103 (Merity et al., 2017).
filtering has a moderate negative impact on LM
loss which increases with stronger filtering. The Previous works have shown that toxicity classi-
train-filter@0.05 model loss roughly matches the fiers often show lower performance for text written
LM loss level of a 417M parameter model (about by, or referring to marginalized identity groups
a third the size), trained on C4 without any inter- (Sap et al., 2019a; Dixon et al., 2018). Given that
ventions. Evaluation on the L AMBADA dataset (Pa- many detoxification techniques heavily rely on tox-
perno et al., 2016) confirms this trend, with an icity classifiers, we investigate how detoxification
accuracy decrease from 50.1% to 34.9% for train- affects topic and dialect coverage with respect to
filter@0.05 (Table 7, appendix). To shed more light different identity groups. We also discuss poten-
on the origins of deteriorated LM performance, we tial representational harms (Barocas et al., 2017)
note that LM loss increase is particularly strong for which can arise from disparities in the effectiveness
text labeled as toxic by P ERSPECTIVE API. For ex- of LM toxicity mitigation across different dialects.
ample, the loss on evaluation documents least likely
to be toxic (score < 0.1) increases by 0.17 (+7%) Datasets We use the gender and ethnicity do-
with the train-filter@0.05 intervention, whereas it mains in the BOLD dataset (Dhamala et al., 2021)
increases by 0.9 (+34%) for the evaluation docu- to evaluate topic coverage. The former contains
ments most likely to be toxic (score ≥ 0.5). Wikipedia sentences about female and male ac-
Text Quality We do not observe any strong differ- tors. Similarly, the latter domain contains sentences
ences for the different toxicity reduction interven- about people with different ethnic backgrounds.
tions compared to the standard LM in how com- We evaluate dialectal coverage using the T WITTER -
prehensible, how grammatical, and how consistent AAE dataset introduced by Blodgett et al. (2016),
with the prompt the generated continuations are: where we use tweets from African-American En-
differences to the standard LM are no larger than glish (AAE) and White Aligned English (WAE)
1%, 4%, and 1%, respectively (Table 10, appendix). subsets. We hope that future work can also con-
Effect on LM’s Ability to Detect Toxicity sider a broader array of groups, including unob-
When training on a toxicity-filtered LM corpus served (Tomasev et al., 2021) and flexible (Andrus
(threshold 0.05), we notice a modest drop in the F1 - et al., 2021) categories. Further dataset details are
score (to 0.73; -0.05 points) of the PPLM toxicity in Appendix B.1.
classifier, which is trained on the LM’s represen-
tations. This could potentially negatively impact 8.1 Topic-related Biases
self-debiasing strategies (Schick et al., 2020). We investigate the effects of toxicity reduction on
the LM’s topic coverage, i.e. its ability to model
8 Social Bias Amplification text about various identity groups. Figure 5 shows
Fairness with respect to all identity groups is cru- that train-time filtering – while generally leading
cial if LMs are to be used in the real world. Two to increased loss – indeed has a disparate impact
properties, that we highlight as necessary (but in- on topic coverage when measured via loss gaps
sufficient) for fairness are that LMs should both be relative to a standard LM on the same documents.
able to model text about topics related to different This holds for both gender (Figure 5a) and ethnic
identity groups (i.e. topic coverage), and also text (Figure 5b) groups. While the standard model has
by people from different identity groups and with similar loss for text about female and male actors
different dialects (i.e. dialect coverage). (3.414 vs. 3.412), detoxification introduces gender

(a) Gender (b) Ethnicity (c) Demographic dialect
Figure 5: LM loss gap between a standard LM and the train-filter@X LMs (denoted as tf@X), on different subsets
of BOLD (gender and ethnicity) and T WITTER AAE (demographic dialects). Some subsets already have substan-
tially higher loss under a standard LM; we calculate the loss gap in order to avoid this as a potential confounding
factor. While toxicity reduction increases loss on all subsets, the impact is largest for marginalized groups.

bias, leading to larger LM loss for female actors Exp. Max. Toxicity Prob. of Toxicity
Model AAE WAE AAE WAE
relative to male actors. Similarly, we observe that
LM loss deterioration is stronger for marginalized standard 0.66 0.58 0.72 0.59
train-filter@0.05 0.39 0.34 0.22 0.14
ethnic groups compared to European-Americans.
Although the standard LM has the lowest loss for Table 3: Expected Maximum Toxicity and Probability
Hispanic-American-related text (3.46 vs. 3.68 for of Toxicity for a standard LM and a train-filter@0.05
European-American), Hispanic-American sees the model, as in Table 1, with T WITTER AAE tweets as
largest negative impact of detoxification. This indi- prompts.
cates that detoxification techniques may introduce
biases distinct from those already existing in LMs.
LM Toxicity Reduction with Prompts from Dif-
ferent Dialects Next we measure the effective-
8.2 Dialect-related Biases ness of LM detoxification for prompts in different
Disparate Positive Rates for Tweets Based on dialects, using the T WITTER AAE tweets in AAE
Demographic Dialect Besides lexical biases, and WAE to prompt the LM. We first apply the auto-
toxicity classifiers have also been shown to exhibit matic metrics from Section 5 to the LM-generated
dialectal biases (Sap et al., 2019a). Our analysis continuations, as shown in Table 3. This shows
shows that T WITTER AAE tweets are more likely to substantially higher values for AAE prompts than
be classified as toxic (details in Appendix G.2), con- for WAE under the standard LM (e.g. 0.72 vs. 0.59
gruent with prior work (Zhou et al., 2021), demon- Probability of Toxicity). LM detoxification reduces
strating bias against AAE in toxicity classifiers. automatic toxicity metrics in both dialects, but av-
This suggests that toxicity reduction interventions erage LM toxicity scores remain still substantially
might adversely affect dialectical coverage. Inves- higher for AAE prompts after detoxification (e.g.
tigating this further, we next analyze impacts on 0.22 vs. 0.14 Probability of Toxicity).
a LM’s ability to model language from different Turning to human evaluation, we collect 100
demographic dialects. samples for each setting (model × dialect), follow-
ing the evaluation protocol in Section 6. Table 4
Disparate Impacts on Dialect Coverage Fig- shows that the train-filter@0.05 LM also reduces
ure 5c shows relative loss gaps between the detox- average human toxicity scores, in particular for
ified and the standard models, for both AAE and AAE. In contrast to what automatic evaluation may
WAE tweets. Consistent with Xu et al. (2021), suggest, in this human evaluation we find similar
we find that detoxification has larger impact on levels of toxicity between the dialects, underscor-
AAE coverage than for WAE. We note that AAE ing the limitations of using automatic evaluation
tweets already have substantially higher loss under alone.
a standard LM (5.53 vs. 4.77), which is likely a
result of the underrepresentation (0.07% of all doc- 8.3 Limitations of Likelihood for Bias
uments) of AAE in C4, as highlighted by Dodge Evaluation
et al. (2021). This bias is further amplified with Our above evaluations on LM coverage primarily
detoxification. rely on likelihood-based loss metrics. However it is

Model AAE WAE defining sub-types of toxicity, and including sep-
standard 0.110.04 0.100.02 arate test sets for each sub-type. We have further
train-filter@0.05 0.020.03 0.040.04 identified a transfer of toxicity classifier bias onto
Table 4: Average human toxicity scores for model com- LMs, which supports the importance of debias-
pletions of AAE and WAE prompts from T WITTER - ing toxicity classifiers. Based on our results, we
AAE. Standard errors are given as subscripts. additionally highlight the following challenges in
mitigating toxic language in LMs.
worth noting that such an evaluation can potentially First, toxicity is subjective and context depen-
underestimate existing LM bias. dent – what is considered toxic may differ across
For instance, consider the loss gap on the BOLD cultures, social groups, and personal experiences.
dataset incurred by a test-time filtering variant Though existing methods can effectively optimize
which picks the best of K generated samples. automatic toxicity scores, precisely defining what
While the small and similar loss gaps – between we should measure is an open challenge. Ulti-
0.09 and 0.13 across all groups (see Table 11 in mately, this will be dependent on users and ap-
Appendix H) – suggests a minimal impact on topic plications, and requires cross-disciplinary expertise
coverage, it is worth noting that even for highly and input from a broad variety of groups.
biased classifiers, e.g. a classifier which flags any Secondly, very low automatic toxicity metrics of
text mentioning female actors as toxic, the impact state-of-the-art LMs after application of the evalu-
on loss-per-token is tightly bounded based on the ated mitigation techniques suggest that further im-
following observation: provement with respect to these metrics is limited.
Observation 1 (Informal). Irrespective of the clas- It is unclear if further optimization against auto-
sifier used for filtering, test-time filtering with a matic toxicity metrics will lead to improvements in
minimum acceptance rate of will never increase toxicity as judged by humans, or only intensify un-
loss-per-token by more than −n−1 ln , where n is intended and problematic side effects of automatic
the document length. detoxification. We also point out limitations in col-
lecting human ratings, including potential negative
The formal statement and proof are included in
psychological impact on annotators.
Appendix H. Thus, LMs with low loss can still have
bad samples, including effects concentrated on par- Finally, our detoxification increases LM loss,
ticular topics and dialects. Although this example and introduces and amplifies social biases in topic
refers specifically to test-time filtering, similar un- and dialect coverage, potentially leading to de-
derlying concerns also apply to other filtering tech- creased LM performance for marginalized groups.
niques, including train-time filtering, fine-tuning, We note that although this problem exists in current
or PPLM. Similar observations have been made pre- methods, this tradeoff is not necessarily unavoid-
viously (van den Oord and Dambre, 2015); we add able, particularly if future work enables less biased
that these limitations become particularly salient classifiers. Alongside toxicity, future work should
when using filtering-based techniques. consider other metrics, such as loss gaps for dif-
ferent topics and dialects. As noted in Section 8.3,
We thus recommend caution in interpreting
loss gaps are an imperfect metric; future work on
likelihood-based metrics: while large loss gaps
developing quantitative metrics for LM bias could
can demonstrate high bias, small loss gaps do not
help better understand trade-offs in mitigating toxi-
automatically imply low bias.
city.
9 Conclusion
10 Ethical Considerations
In this work, we have examined and discussed chal-
lenges of LM toxicity evaluation and side-effects of Our goal in this work is to reduce harms from LMs
automatic toxicity mitigation using a combination by better understanding how to detoxify LMs, and
of relatively simple toxicity reduction approaches characterizing any trade-offs that occur when detox-
and previously published methods. We have high- ifying LMs. During the course of our research, we
lighted the discrepancy between conventional met- encountered a variety of ethical questions, includ-
rics of toxicity and what is perceived by humans. ing how to ethically collect human annotations for
This points towards a research roadmap of defin- toxic language (detailed in Section 10.1).
ing metrics that better align with perceived toxicity, As discussed in Section 3, toxicity is subjective

and ill-defined. The definition of what is “toxic” or annotators.
“offensive” may differ between social groups and Because of the sensitive nature of annotating
cultures. Language acceptable to those who wield toxic language, we ensured that several options
more privilege may be offensive to those who wield were available to annotators. Annotators could
less privilege. While our current methods might choose to split their time between our task and
mitigate toxicity as defined by some people, it may other tasks which did not include toxic content.
not be sufficient for others. Annotators were given the option to (and did) opt
In this work, we only consider English LMs, out of annotating data for our task. Annotators self-
though there are over 7, 000 languages spoken determined the amount of time they annotated our
throughout the world (Joshi et al., 2020), and we data and had access to employee resources for well-
recommend caution when generalizing our find- being concerns caused by our annotation task. We
ings to non-English LMs. We note that the P ER - tracked well-being via a well-being survey. Results
SPECTIVE API includes toxicity classifiers for six of this survey are detailed in Appendix E.4.
languages besides English,9 though we do not at- We acknowledge that our annotation instructions
tempt to mitigate toxicity on non-English LMs with do not include race and dialect priming as intro-
non-English classifiers here. However, ethical de- duced by Sap et al. (2019a) to mitigate racial bias
ployment of LMs requires equitable access and in hate speech annotations. Thus some of our an-
safety also for non-English speakers. notators may be unaware that identity groups and
In considering the potential harms of LMs there specifically African-Americans reclaim offensive
are many more facets than we have considered in and racist terms and use them safely. However, we
this paper. Here we discuss one important dimen- annotate LM continuations, not human written lan-
sion, but other potential harms have been discussed guage. As LMs do not have an identity, we do not
in prior work, such as, but not limited to, statistical believe it is safe for generated language to include
biases (Sheng et al., 2019; Huang et al., 2020; Abid reclaimed terms, even if they can be safely used by
et al., 2021), privacy concerns (Carlini et al., 2020), members of marginalized groups. We acknowledge
and environmental impact (Strubell et al., 2019), that there are applications for which this approach
alongside points raised by Bender et al. (2021), would be incorrect.
which should also be considered when striving for
ethical LMs. 11 Acknowledgements

10.1 Human Evaluation We would like to thank James Besley, Phil Blun-
som, Taylan Cemgil, Sanah Choudhry, Iason
Asking humans to annotate toxicity necessarily ex- Gabriel, Geoffrey Irving, Maribeth Rauh, Sebas-
poses them to toxic language. Before conduct- tian Ruder, and Laura Weidinger for comments and
ing our study, it was reviewed by DeepMind’s discussion on earlier versions of this draft, as well
Human Behavioural Research Ethics Commit- as Lucy Vasserman and Jeffrey Sorensen for provid-
tee (HuBREC). ing support on using P ERSPECTIVE API. We have
Participants were recruited through Google’s in- shared the findings of this work with the Jigsaw
ternal labeling platform, a service that hires con- team.
tractors to complete tasks. Annotators are hired
to perform a variety of annotation tasks and are
paid based on time worked, not per HITs com- References
pleted. We design our human evaluation experi- Abubakar Abid, Maheen Farooqi, and James Zou.
ments, then work with the annotation platform to 2021. Persistent anti-Muslim bias in large language
ensure annotators understand the task. Annotator models. CoRR, abs/2101.05783.
training (including a module on wellbeing) takes
McKane Andrus, Elena Spitzer, Jeffrey Brown, and Al-
approximately one hour. Uncertainty in the task is ice Xiang. 2021. What we can’t measure, we can’t
directly communicated to us (the researchers). In understand: Challenges to demographic data pro-
our initial annotation pilot, the authors also anno- curement in the pursuit of fairness. In Proceedings
tated sentences and observed similar trends to the of the 2021 ACM Conference on Fairness, Account-
ability, and Transparency, pages 249–260.
9
When considering production level for the TOXICITY
attribute: https://developers.perspectiveapi.com/s/about-the- Solon Barocas, Kate Crawford, Aaron Shapiro, and
api-attributes-and-languages Hanna Wallach. 2017. The problem with bias:

from allocative to representational harms in machine In International Conference on Learning Represen-
learning. special interest group for computing. Infor- tations.
mation and Society (SIGCIS), 2.
Thomas Davidson, Debasmita Bhattacharya, and Ing-
Emily M. Bender, Timnit Gebru, Angelina McMillan- mar Weber. 2019. Racial bias in hate speech and
Major, and Shmargaret Shmitchell. 2021. On the abusive language detection datasets. In Proceedings
dangers of stochastic parrots: Can language models of the Third Workshop on Abusive Language Online,
be too big? In Proceedings of the 2021 ACM Confer- pages 25–35, Florence, Italy. Association for Com-
ence on Fairness, Accountability, and Transparency, putational Linguistics.
FAccT ’21, page 610–623, New York, NY, USA. As-
sociation for Computing Machinery. Thomas Davidson, Dana Warmsley, M. Macy, and Ing-
mar Weber. 2017. Automated hate speech detection
Su Lin Blodgett, Lisa Green, and Brendan O’Connor. and the problem of offensive language. In ICWSM.
2016. Demographic dialectal variation in social
media: A case study of African-American English. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
In Proceedings of the 2016 Conference on Empiri- Kristina Toutanova. 2019. BERT: Pre-training of
cal Methods in Natural Language Processing, pages deep bidirectional transformers for language under-
1119–1130, Austin, Texas. Association for Compu- standing. In Proceedings of the 2019 Conference
tational Linguistics. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Shikha Bordia and Samuel R Bowman. 2019. Identify- Technologies, Volume 1 (Long and Short Papers),
ing and reducing gender bias in word-level language pages 4171–4186, Minneapolis, Minnesota. Associ-
models. arXiv preprint arXiv:1904.03035. ation for Computational Linguistics.
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya
Thain, and Lucy Vasserman. 2019. Nuanced metrics Krishna, Yada Pruksachatkun, Kai-Wei Chang, and
for measuring unintended bias with real data for text Rahul Gupta. 2021. BOLD: Dataset and metrics
classification. CoRR, abs/1903.04561. for measuring biases in open-ended language gen-
eration. In Proceedings of the 2021 ACM Confer-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie ence on Fairness, Accountability, and Transparency,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind FAccT ’21, page 862–872, New York, NY, USA. As-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda sociation for Computing Machinery.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, and Lucy Vasserman. 2018. Measuring and mitigat-
Clemens Winter, Christopher Hesse, Mark Chen, ing unintended bias in text classification. In Pro-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin ceedings of the 2018 AAAI/ACM Conference on AI,
Chess, Jack Clark, Christopher Berner, Sam Mc- Ethics, and Society, AIES ’18, page 67–73.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Jesse Dodge, Maarten Sap, Ana Marasovic, William
ers. Agnew, Gabriel Ilharco, Dirk Groeneveld, and
Matt Gardner. 2021. Documenting the English
Nicholas Carlini, Florian Tramer, Eric Wallace, colossal clean crawled corpus. arXiv preprint
Matthew Jagielski, Ariel Herbert-Voss, Katherine arXiv:2104.08758.
Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
far Erlingsson, et al. 2020. Extracting training Samuel Gehman, Suchin Gururangan, Maarten Sap,
data from large language models. arXiv preprint Yejin Choi, and Noah A. Smith. 2020. RealToxi-
arXiv:2012.07805. cityPrompts: Evaluating neural toxic degeneration
in language models. In Findings of the Association
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- for Computational Linguistics: EMNLP 2020, pages
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. 3356–3369, Online. Association for Computational
Transformer-XL: Attentive language models beyond Linguistics.
a fixed-length context. In Proceedings of the 57th
Annual Meeting of the Association for Computa- Suchin Gururangan, Ana Marasović, Swabha
tional Linguistics, pages 2978–2988, Florence, Italy. Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
Association for Computational Linguistics. and Noah A. Smith. 2020. Don’t stop pretraining:
Adapt language models to domains and tasks. In
Brandon Dang, Martin J Riedl, and Matthew Lease. Proceedings of the 58th Annual Meeting of the
2018. But who protects the moderators? the case Association for Computational Linguistics, pages
of crowdsourced image moderation. arXiv preprint 8342–8360, Online. Association for Computational
arXiv:1804.10999. Linguistics.
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Hung, Eric Frank, Piero Molino, Jason Yosinski, and Yejin Choi. 2020. The curious case of neural text de-
Rosanne Liu. 2020. Plug and play language mod- generation. In International Conference on Learn-
els: A simple approach to controlled text generation. ing Representations.

Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan- Cicero Nogueira dos Santos, Igor Melnyk, and Inkit
forth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Padhi. 2018. Fighting offensive language on social
Yogatama, and Pushmeet Kohli. 2020. Reducing media with unsupervised text style transfer. In Pro-
sentiment bias in language models via counterfac- ceedings of the 56th Annual Meeting of the Associa-
tual evaluation. In Findings of the Association for tion for Computational Linguistics (Volume 2: Short
Computational Linguistics: EMNLP 2020, pages Papers), pages 189–194, Melbourne, Australia. As-
65–83, Online. Association for Computational Lin- sociation for Computational Linguistics.
guistics.
Denis Paperno, Germán Kruszewski, Angeliki Lazari-
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika dou, Ngoc Quan Pham, Raffaella Bernardi, San-
Bali, and Monojit Choudhury. 2020. The state and dro Pezzelle, Marco Baroni, Gemma Boleda, and
fate of linguistic diversity and inclusion in the NLP Raquel Fernández. 2016. The LAMBADA dataset:
world. arXiv preprint arXiv:2004.09095. Word prediction requiring a broad discourse context.
In Proceedings of the 54th Annual Meeting of the As-
Muhammad Khalifa, Hady Elsahar, and Marc Dymet- sociation for Computational Linguistics (Volume 1:
man. 2020. A distributional approach to controlled Long Papers), pages 1525–1534, Berlin, Germany.
text generation. CoRR, abs/2012.11635. Association for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In 3rd Inter- Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-
national Conference on Learning Representations, ducing gender bias in abusive language detection.
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, In Proceedings of the 2018 Conference on Em-
Conference Track Proceedings. pirical Methods in Natural Language Processing,
pages 2799–2804, Brussels, Belgium. Association
Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc- for Computational Linguistics.
Cann, Nitish Shirish Keskar, Shafiq Joty, Richard
Socher, and Nazneen Rajani. 2021. GeDi: Gener- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
ative discriminator guided sequence generation. Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Taku Kudo and John Richardson. 2018. SentencePiece: blog, 1(8):9.
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Proceedings of the 2018 Conference on Empirical ine Lee, Sharan Narang, Michael Matena, Yanqi
Methods in Natural Language Processing: System Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
Demonstrations, pages 66–71, Brussels, Belgium. the limits of transfer learning with a unified text-to-
Association for Computational Linguistics. text transformer. Journal of Machine Learning Re-
search, 21(140):1–67.
Irene Kwok and Y. Wang. 2013. Locate the hate: De-
tecting tweets against blacks. In AAAI. Björn Ross, Michael Rist, Guillermo Carbonell, Ben
Cabrera, Nils Kurowsky, and Michael Wojatzki.
Léo Laugier, John Pavlopoulos, Jeffrey Sorensen, and 2016. Measuring the Reliability of Hate Speech An-
Lucas Dixon. 2021. Civil rephrases of toxic texts notations: The Case of the European Refugee Cri-
with self-supervised transformers. In Proceedings of sis. In Proceedings of NLP4CMC III: 3rd Workshop
the 16th Conference of the European Chapter of the on Natural Language Processing for Computer-
Association for Computational Linguistics: Main Mediated Communication, pages 6–9.
Volume, pages 1442–1461, Online. Association for
Computational Linguistics. Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
and Noah A. Smith. 2019a. The risk of racial bias
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, in hate speech detection. In Proceedings of the
and Bill Dolan. 2015. A diversity-promoting objec- 57th Annual Meeting of the Association for Com-
tive function for neural conversation models. CoRR, putational Linguistics, pages 1668–1678, Florence,
abs/1510.03055. Italy. Association for Computational Linguistics.
Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth
Dathathri, and Pascale Fung. 2020. Plug-and-play Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Ju-
conversational models. CoRR, abs/2010.04344. rafsky, Noah A Smith, and Yejin Choi. 2019b.
Social bias frames: Reasoning about social and
Stephen Merity, Caiming Xiong, James Bradbury, and power implications of language. arXiv preprint
Richard Socher. 2016. Pointer sentinel mixture mod- arXiv:1911.03891.
els. arXiv preprint arXiv:1609.07843.
Timo Schick, Helmut Schmid, and Hinrich Schütze.
Stephen Merity, Caiming Xiong, James Bradbury, and 2020. Automatically identifying words that can
Richard Socher. 2017. Pointer sentinel mixture mod- serve as labels for few-shot text classification. In
els. In 5th International Conference on Learning Proceedings of the 28th International Conference
Representations, ICLR 2017, Toulon, France, April on Computational Linguistics, pages 5569–5578,
24-26, 2017, Conference Track Proceedings. Open- Barcelona, Spain (Online). International Committee
Review.net. on Computational Linguistics.

Timo Schick, Sahana Udupa, and Hinrich Schütze. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
2021. Self-diagnosis and self-debiasing: A proposal Ex machina: Personal attacks seen at scale. In Pro-
for reducing corpus-based bias in NLP. ceedings of the 26th International Conference on
World Wide Web, pages 1391–1399.
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
and Nanyun Peng. 2019. The woman worked as Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Guru-
a babysitter: On biases in language generation. In rangan, Maarten Sap, and Dan Klein. 2021. Detoxi-
Proceedings of the 2019 Conference on Empirical fying language models risks marginalizing minority
Methods in Natural Language Processing and the voices.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3407– Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason
3412, Hong Kong, China. Association for Computa- Weston, and Emily Dinan. 2020. Recipes for safety
tional Linguistics. in open-domain chatbots.

Miriah Steiger, Timir J Bharucha, Sukrit Venkatagiri, Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Martin J Riedl, and Matthew Lease. 2021. The psy- Sara Rosenthal, Noura Farra, and Ritesh Kumar.
chological well-being of content moderators. In Pro- 2019. SemEval-2019 task 6: Identifying and catego-
ceedings of the 2021 CHI Conference on Human rizing offensive language in social media (OffensE-
Factors in Computing Systems, CHI, volume 21. val). In Proceedings of the 13th International Work-
shop on Semantic Evaluation, pages 75–86, Min-
Emma Strubell, Ananya Ganesh, and Andrew Mc- neapolis, Minnesota, USA. Association for Compu-
Callum. 2019. Energy and policy considera- tational Linguistics.
tions for deep learning in NLP. arXiv preprint
arXiv:1906.02243. Xuhui Zhou, Maarten Sap, Swabha Swayamdipta,
Yejin Choi, and Noah Smith. 2021. Challenges in au-
Lucas Theis, Aäron van den Oord, and Matthias tomated debiasing for toxic language detection. In
Bethge. 2015. A note on the evaluation of genera- Proceedings of the 16th Conference of the European
tive models. arXiv preprint arXiv:1511.01844. Chapter of the Association for Computational Lin-
guistics: Main Volume, pages 3143–3155, Online.
J. Thomas. 1983. Cross-cultural pragmatic failure. Ap- Association for Computational Linguistics.
plied Linguistics, 4:91–112.

Nenad Tomasev, Kevin R McKee, Jackie Kay, and
Shakir Mohamed. 2021. Fairness for unobserved
characteristics: Insights from technological im-
pacts on queer communities. arXiv preprint
arXiv:2102.04257.

Aäron van den Oord and Joni Dambre. 2015. Locally-
connected transformations for deep GMMs. In Inter-
national Conference on Machine Learning (ICML):
Deep learning Workshop, pages 1–8.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
and Sameer Singh. 2019. Universal adversarial trig-
gers for attacking and analyzing NLP. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2153–2162, Hong
Kong, China. Association for Computational Lin-
guistics.

William Warner and Julia Hirschberg. 2012. Detecting
hate speech on the world wide web. In Proceedings
of the Second Workshop on Language in Social Me-
dia, pages 19–26, Montréal, Canada. Association for
Computational Linguistics.

Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
bols or hateful people? predictive features for hate
speech detection on Twitter. In Proceedings of the
NAACL Student Research Workshop, pages 88–93,
San Diego, California. Association for Computa-
tional Linguistics.

You can also read