Diverse Adversaries for Mitigating Bias in Training

Page created by Karen Cortez

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Diverse Adversaries for Mitigating Bias in Training

                                                    Xudong Han                    Timothy Baldwin                Trevor Cohn
                                                                     School of Computing and Information Systems
                                                                             The University of Melbourne
                                                                                Victoria 3010, Australia
                                                                     xudongh1@student.unimelb.edu.au
                                                                     {tbaldwin,tcohn}@unimelb.edu.au

                                                               Abstract                          maximizing the attacker loss (i.e., preventing pro-
                                                                                                 tected attributes from being detected by the at-
arXiv:2101.10001v1 [cs.LG] 25 Jan 2021

                                             Adversarial learning can learn fairer and less      tacker). Preventing protected attributes from be-
                                             biased models of language than standard meth-       ing detected tends to result in fairer models, as
                                             ods. However, current adversarial techniques
                                                                                                 protected attributes will more likely be indepen-
                                             only partially mitigate model bias, added to
                                             which their training procedures are often unsta-
                                                                                                 dent rather than confounding variables. Although
                                             ble. In this paper, we propose a novel approach     this method leads to demonstrably less biased
                                             to adversarial learning based on the use of mul-    models, there are still limitations, most notably
                                             tiple diverse discriminators, whereby discrimi-     that significant protected information still remains
                                             nators are encouraged to learn orthogonal hid-      in the model’s encodings and prediction outputs
                                             den representations from one another. Experi-       (Wang et al., 2019; Elazar and Goldberg, 2018).
                                             mental results show that our method substan-
                                                                                                    Many different approaches have been proposed
                                             tially improves over standard adversarial re-
                                             moval methods, in terms of reducing bias and
                                                                                                 to strengthen the attacker, including: increasing
                                             the stability of training.                          the discriminator hidden dimensionality; assign-
                                                                                                 ing different weights to the adversarial compo-
                                         1 Introduction                                          nent during training; using an ensemble of ad-
                                                                                                 versaries with different initializations; and reini-
                                         While NLP models have achieved great successes,         tializing the adversarial weights every t epochs
                                         results can depend on spurious correlations with        (Elazar and Goldberg, 2018). Of these, the en-
                                         protected attributes of the authors of a given text,    semble method has been shown to perform best,
                                         such as gender, age, or race. Including protected       but independently-trained attackers can generally
                                         attributes in models can lead to problems such          still detect private information after adversarial re-
                                         as leakage of personally-identifying information        moval.
                                         of the author (Li et al., 2018a), and unfair mod-          In this paper, we adopt adversarial debiasing ap-
                                         els, i.e., models which do not perform equally          proaches and present a novel way of strengthen-
                                         well for different sub-classes of user. This kind       ing the adversarial component via orthogonality
                                         of unfairness has been shown to exist in many           constraints (Salzmann et al., 2010). Over a senti-
                                         different tasks, including part-of-speech tagging       ment analysis dataset with racial labels of the doc-
                                         (Hovy and Søgaard, 2015) and sentiment analysis         ument authors, we show our method to result in
                                         (Kiritchenko and Mohammad, 2018).                       both more accurate and fairer models, with privacy
                                            One approach to diminishing the influence of         leakage close to the lower-bound.1
                                         protected attributes is to use adversarial methods,
                                         where an encoder attempts to prevent a discrimi-        2 Methodology
                                         nator from identifying the protected attributes in
                                         a given task (Li et al., 2018a). Specifically, an ad-   Formally, given an input xi annotated with main
                                         versarial network is made up of an attacker and         task label yi and protected attribute label gi , a main
                                         encoder, where the attacker detects protected in-       task model M is trained to predict ŷi = M (xi ),
                                         formation in the representation of the encoder,         and an adversary, aka “discriminator”, A is trained
                                         and the optimization of the encoder incorporates          1
                                                                                                     Source     code       available    at
                                         two parts: (1) minimizing the main loss, and (2)        https://github.com/HanXudong/Diverse_Adversaries_for_M

ŷi novel means of strengthening the adversarial com-
CM ponent. Figure 1 shows a typical ensemble archi-
tecture where k sub-discriminators are included in
the adversarial component, leading to an averaged
xi hM,i hA1 ,i ĝA1 ,i adversarial regularisation term:
EM EA1 CA 1
.. .. λadv X
. . − X (g, ĝAj ).
k
hAk ,i ĝAk ,i j∈{1,...,k}
EAk CA k
One problem associated with this ensemble ar-
Figure 1: Ensemble adversarial method. Dashed chitecture is that it cannot ensure that different sub-
lines denote gradient reversal in adversarial learning. discriminators focus on different aspects of the
The k sub-discriminators Ai are independently ini- representation. Indeed, experiments have shown
tialized. Given a single input xi , the main task en- that sub-discriminator ensembles can weaken
coder computes a hidden representation hM,i , which is
the adversarial component (Elazar and Goldberg,
used as the input to the main model output layer and
sub-discriminators. From the k-th sub-discriminator, 2018). To address this problem, we further intro-
the estimated protected attribute label is ĝAk ,i = duce a difference loss (Bousmalis et al., 2016) to
CAk (EAk (hM,i )). encourage the adversarial encoders to encode dif-
ferent aspects of the private information. As can
be seen in Figure 1, hAk ,i denotes the output from
to predict ĝi = A(hM,i ) from M ’s last hidden the k-th sub-discriminator encoder given a hidden
layer representation hM,i . In this paper, we treat a representation hM,i , i.e., hAk ,i = EAk (hM,i ).
neural network classifier as a combination of two The difference loss encourages orthogonality
connected parts: (1) an encoder E, and (2) a lin- between the encoding representations of each pair
ear classifier C. For example, in the main task of sub-discriminators:
model M , the encoder EM is used to compute the
2
hidden representation hM,i from an input xi , i.e.,
X
Ldiff = λdiff hAi ⊺ hAj 1(i 6= j),
hM,i = EM (xi ), and the decoder is used to make i,j∈{1,...,k}
F
a prediction, ŷi = CM (hM,i ). Similarly, for a dis-
criminator, ĝi = A(hM,i ) = CA (EA (hM,i )). wherek·k2F is the squared Frobenius norm.
Intuitively, sub-discriminator encoders must
2.1 Adversarial Learning
learn different ways of identifying protected in-
Following the setup of Li et al. (2018a) and formation given the same input embeddings, re-
Elazar and Goldberg (2018) the optimisation ob- sulting in less biased models than the standard
jective for our standard adversarial training is: ensemble-based adversarial method. According
to Bousmalis et al. (2016), the difference loss has
min max X (y, ŷM ) − λadv X (g, ĝA ), the additional advantage of also being minimized
M A
when hidden representations shrink to zero. There-
where X is cross entropy loss, and λadv is the fore, instead of minimizing the difference loss by
trade-off hyperparameter. Solving this minimax learning rotated hidden representations (i.e., the
optimization problem encourages the main task same model), this method biases adversaries to
model hidden representation hM to be informa- have representations that are a) orthogonal, and b)
tive to CM and uninformative to A. Following low magnitude; the degree to which is given by
Ganin and Lempitsky (2015), the above can be weight decay of the optimization function.
trained using stochastic gradient optimization with
a gradient reversal layer for X (g, ĝA ). 2.3 INLP
We include Iterative Null-space Projection
2.2 Differentiated Adversarial Ensemble (“INLP”: Ravfogel et al. (2020)) as a baseline
Inspired by the ensemble adversarial method method for mitigating bias in trained models, in
(Elazar and Goldberg, 2018) and domain sep- addition to standard and ensemble adversarial
aration networks (Bousmalis et al., 2016), we methods. In INLP, a linear discriminator (Alinear )
present differentiated adversarial ensemble, a of the protected attribute is iteratively trained

Model                              Accuracy↑     TPR Gap↓       TNR Gap↓       Leakage@h↓       Leakage@ŷ↓
 Random                             50.00±0.00      0.00±0.00     0.00±0.00         —                —
 Fixed Encoder                      61.44±0.00      0.52±0.00    17.97±0.00     92.07±0.00       86.93±0.00
 Standard                           71.59±0.05    31.81±0.29     48.41±0.27     85.56±0.20       70.09±0.19
 INLP                               68.54±1.05    25.13±2.31     40.70±5.02     66.64±0.87       66.19±0.79
 Adv Single Discriminator           74.25±0.39    13.01±3.83     28.55±3.60     84.33±0.98       61.48±2.17
 Adv Ensemble                       74.08±0.99    12.04±3.50     31.76±3.19     85.31±0.51       63.23±3.62
 Differentiated Adv Ensemble        74.52±0.28      8.42±1.84    24.74±2.07     84.52±0.50       61.09±2.32

Table 1: Evaluation results ± standard deviation (%) on the test set, averaged over 10 runs with different random
seeds. Bold = best performance. “↑” and ”↓” indicate that higher and lower performance, resp., is better for the
given metric. Leakage measures the accuracy of predicting the protected attribute, over the final hidden represen-
tation h or model output ŷ. Since the Fixed Encoder is not designed for binary sentiment classification, we merge
the original 64 labels into two categories based on the results of hierarchical clustering.

from pre-computed fixed hidden representations            tive Rate (TNR), respectively, across different pro-
(i.e., hM ) to project them onto the linear discrim-      tected attributes (De-Arteaga et al., 2019). This
inator’s null-space, h∗M = PN (Alinear ) hM , where       measurement is related to the criterion that the pre-
PN (Alinear ) is the null-space projection matrix         diction ŷ is conditionally independent of the pro-
of Alinear . In doing so, it becomes difficult for        tected attribute g given the main task label y (i.e.,
the protected attribute to be linearly identified         ŷ⊥g|y). Assuming a binary protected attribute,
from the projected hidden representations (h∗M ),         this conditional independence requires P{ŷ|y, g =
and any linear main-task classifier (CM   ∗ ) trained     0} = P{ŷ|y, g = 1}, which implies an objective
on h∗M can thus be expected to make fairer                that minimizes the difference (GAP) between the
predictions.                                              two sides of the equation.

3 Experiments                                             Linear Leakage We also measure the leakage of
                                                          protected attributes. A model is said to leak infor-
Fixed Encoder Following Elazar and Goldberg               mation if the protected attribute can be predicted
(2018) and Ravfogel et al. (2020), we use the             at a higher accuracy than chance, in our case, from
DeepMoji model (Felbo et al., 2017) as a fixed-           the hidden representations the fixed encoder gener-
parameter encoder (i.e. it is not updated during          ates. We empirically quantify leakage with a linear
training). The DeepMoji model is trained over             support vector classifier at two different levels:
1246 million tweets containing one of 64 common               • Leakage@h: the accuracy of recovering the
emojis. We merge the 64 emoji labels output by                  protected attribute from the output of the fi-
DeepMoji into two super-classes based on hierar-                nal hidden layer after the activation function
chical clustering: ‘happy’ and ‘sad’.                           (hM ).
Models The encoder EM consists of a fixed                     • Leakage@ŷ: the accuracy of recovering the
pretrained encoder (DeepMoji) and two trainable                 protected attribute from the output ŷ (i.e., the
fully connected layers (“Standard” in Table 1). Ev-             logits) of the main model.
ery linear classifier (C) is implemented as a dense
                                                          Data We experiment with the dataset of
layer.
                                                          Blodgett et al. (2016), which contains tweets
   For protected attribute prediction, a discrimina-
                                                          that are either African American English (AAE)-
tor (A) is a 3-layer MLP where the first 2 layers are
                                                          like or Standard American English (SAE)-like
collectively denoted as EA , and the output layer is
                                                          (following Elazar and Goldberg (2018) and
denoted as CA .
                                                          Ravfogel et al. (2020)). Each tweet is annotated
TPR-GAP and TNR-GAP In classification                     with a binary “race” label (on the basis of AAE
problems, a common way of measuring bias is               or SAE) and a binary sentiment score, which is
TPR-GAP and TNR-GAP, which evaluate the gap               determined by the (redacted) emoji within it.
in the True Positive Rate (TPR) and True Nega-               In total, the dataset contains 200k instances,

perfectly balanced across the four race–sentiment Results and Analysis Table 1 shows the results
combinations. To create bias in the dataset, we over the test set. Training on a biased dataset
follow previous work in skewing the training data without any fairness restrictions leads to a biased
to generate race–sentiment combinations (AAE– model, as seen in the Gap and Leakage results for
happy, SAE–happy, AAE–sad, and SAE–sad) of the Standard model. Consistent with the find-
40%, 10%, 10%, and 40%, respectively. Note that ings of Ravfogel et al. (2020), INLP can only re-
we keep the test data unbiased. duce bias at the expense of overall performance.
On the other hand, the Single Discriminator and
Adv(ersarial) Ensemble baselines both enhance ac-
curacy and reduce bias, consistent with the find-
Training Details All models are trained and ings of Li et al. (2018a).
evaluated on the same training/test split. The Compared to the Adv Ensemble baseline, in-
Adam optimizer (Kingma and Ba, 2015) is used corporating the difference loss in our method has
with learning rates of 3 × 10−5 for the main model two main benefits: training is more stable (re-
and 3 × 10−6 for the sub-discriminators. The sults have smaller standard deviation), and there
minibatch size is set to 1024. Sentence represen- is less bias (the TPR and TNR Gap are smaller).
tations (2304d) are extracted from the DeepMoji Without the orthogonality factor, Ldiff , the sub-
encoder. The hidden size of each dense layer discriminators tend to learn similar representa-
is 300 in the main model, and 256 in the sub- tions, and the ensemble degenerates to a standard
discriminators. We train M for 60 epochs and adversarial model. Simply relying on random ini-
each A for 100 epochs, keeping the checkpoint tialization to ensure sub-discriminator diversity, as
model that performs best on the dev set. Sim- is done in the Adv Ensemble method, is insuf-
ilar to Elazar and Goldberg (2018), hyperparam- ficient. The orthogonality regularization in our
eters (λadv and λdiff ) are tuned separately rather method leads to more stable and overall better re-
than jointly. λadv is tuned to 0.8 based on the sults in terms of both accuracy and TPR/TNR Gap.
standard (single-discriminator) adversarial learn-
ing method, and this setting is used for all other ad- As shown in Table 1, even the Fixed Encoder
model leaks protected information, as a result
versarial methods. When tuning λadv , we consid-
of implicit biases during pre-training. INLP
ered both overall performance and bias gap (both
achieves significant improvement in terms of re-
over the dev data). Since adversarial training can
ducing linear hidden representation leakage. The
increase overall performance while decreasing the
reason is that Leakage@h is directly correlated
bias gap (see Figure 2), we select the adversar-
with the objective of INLP, in minimizing the lin-
ial model that achieves the best task performance.
ear predictability of the protected attribute from
For adversarial ensemble and differentiated mod-
the h. Adversarial methods do little to
els, we tune the hyperparameters (number of sub
attackers and λdiff ) to achieve a similar bias level mitigate Leakage@h, but substantially decrease
while getting the best overall performance. To Leakage@ŷ in the model output. However, both
compare with a baseline ensemble method with types of leakage are well above the ideal value
of 50%, and therefore none of these methods can
a similar number of parameters, we also report
be considered as providing meaningful privacy, in
results for an adversarial ensemble model with 3
part because of the fixed encoder. This finding im-
sub-discriminators. The scalar hyperparameter of
plies that when applying adversarial learning, the
the difference loss (λdiff ) is tuned through grid
pretrained model needs to be fine-tuned with the
search from 10−4 to 104 , and set to 103.7 . For the
adversarial loss to have any chance of generating
INLP experiments, fixed sentence representations
a truly unbiased hidden representation. Despite
are extracted from the same data split. Following
this, adversarial training does reduce the TPR and
Ravfogel et al. (2020), in the INLP experiments,
both the discriminator and the classifier are imple- TNR Gap, and improves overall accuracy, which
mented in scikit-learn as linear SVM classifiers illustrates the utility of the method for both bias
(Pedregosa et al., 2011). We report Leakage@ŷ mitigation and as a form of regularisation.
for INLP based on the predicted confidence scores, Overall, our proposed method empirically out-
which could be interpreted as logits, of the linear performs the baseline models in terms of debias-
SVM classifiers. ing, with a better performance–fairness trade-off.

80                                                        80
 Accuracy (%)

                                                           Accuracy (%)
                70
                                                                          75

                60
                                                                          70

                40                                                        40
 GAP (%)

                                                           Gap (%)
                                                                          30
                20
                                                                          20

                                                                          10
                 0
                     −2   −1       0         1     2                           0    1            2     3         4

                               log 10 λadv                                              log10 λdiff

Figure 2: λadv sensitivity analysis, averaged over 10     Figure 3: λdiff sensitivity analysis for differentiated ad-
runs for a single discriminator adversarial model. Main   versarial models with 3 (“         ”), 5 (“      ”), and 8
task accuracy of group SAE (blue) and AAE (orange),       (“     ”) sub-discriminators, in terms of the main task
TPR-GAP (green), and TNR-GAP (red) are reported.          accuracy of group SAE (blue) and AAE (orange), and
                                                          TPR-GAP (green) and TNR-GAP (red).

Robustness to λadv We first evaluate the influ-
ence of the trade-off hyperparameter λadv in ad-          an overly large value makes the sub-discriminators
versarial learning. As can be seen from Figure 2,         underfit, and both reduces accuracy and increases
λadv controls the performance–fairness trade-off.         TPR/TNR Gap. We observe a negative correla-
Increasing λadv from 10−2 to around 10−0 , TPR            tion between N and λdiff , the main reason being
Gap and TNR Gap consistently decrease, while              that Ldiff is not averaged over N and as a result,
the accuracy of each group rises. To balance up           a large N and λdiff force the sub-discriminators to
accuracy and fairness, we set λadv to 10−0.1 . We         pay too much attention to orthogonality, impeding
also observe that an overly large λadv can lead to a      their ability to bleach out the protected attributes.
more biased model (starting from about 101.2 ).              Overall, we empirically show that λdiff only
                                                          needs to be tuned for Adv Ensemble, since the re-
Robustness to λdiff Figure 3 presents the results
                                                          sults for different Differentiated Adv models for a
of our model with different λdiff values, for N ∈
                                                          given setting achieve similar results. I.e., λdiff can
{3, 5, 8} sub-discriminators.
                                                          safely be tuned separately with all other hyperpa-
   First, note that when λdiff is small (i.e., the
                                                          rameters fixed.
left side of Figure 3), our Differentiated Adv En-
semble model generalizes to the standard Adv
Ensemble model. For differing numbers of sub-             4 Conclusion and Future Work
discriminators, performance is similar, i.e., in-
creasing the number of sub-discriminators beyond          We have proposed an approach to enhance sub-
3 does not improve results substantially, but does        discriminators in adversarial ensembles by intro-
come with a computational cost. This implies              ducing a difference loss. Over a tweet sentiment
that an Adv Ensemble model learns approximately           classification task, we showed that our method
the same thing as larger ensembles (but more effi-        substantially improves over standard adversarial
ciently), where the sub-discriminators can only be        methods, including ensemble-based methods.
explicitly differentiated by their weight initializa-        In future work, we intend to perform experi-
tions (with different random seeds), noting that all      mentation over other tasks. Theoretically, our ap-
sub-discriminators are otherwise identical in archi-      proach is general-purpose, and can be used not
tecture, input, and optimizer.                            only for adversarial debiasing but also any other
   Increasing the weight of the difference loss           application where adversarial training is used,
through λdiff has a positive influence on results, but    such as domain adaptation (Li et al., 2018b).

Acknowledgments on Lexical and Computational Semantics, pages
43–53.
We thank Lea Frermann, Shivashankar Subrama-
nian, and the anonymous reviewers for their help- Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018a.
Towards robust and privacy-preserving text representations.
ful feedback and suggestions. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
2: Short Papers), pages 25–30.
References
Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018b.
Su Lin Blodgett, Lisa Green, What’s in a domain? learning domain-robust text representations usin
and Brendan O’Connor. 2016. In Proceedings of the 2018 Conference of the North
Demographic dialectal variation in social media: A case study of African-American
American Chapter of theEnglish.
Association for Computa-
In Proceedings of the 2016 Conference on Empirical tional Linguistics: Human Language Technologies,
Methods in Natural Language Processing, pages Volume 2 (Short Papers), pages 474–479.
1119–1130.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Konstantinos Bousmalis, George Trigeorgis, Nathan B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
Silberman, Dilip Krishnan, and Dumitru Erhan. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
2016. Domain separation networks. In Advances in D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
Neural Information Processing Systems, pages 343– esnay. 2011. Scikit-learn: Machine learning in
351. Python. Journal of Machine Learning Research,
12:2825–2830.
Maria De-Arteaga, Alexey Romanov, Hanna Wal-
lach, Jennifer Chayes, Christian Borgs, Alexan- Shauli Ravfogel, Yanai Elazar, Hila Gonen,
dra Chouldechova, Sahin Geyik, Krishnaram Michael Twiton, and Yoav Goldberg. 2020.
Kenthapadi, and Adam Tauman Kalai. 2019. Null it out: Guarding protected attributes by iterative nullspace projec
Bias in bios: A case study of semantic representation bias inInaProceedings
high-stakes setting.
of the 58th Annual Meeting of the
In Proceedings of the Conference on Fairness, Ac- Association for Computational Linguistics, pages
countability, and Transparency, FAT* ’19, page 7237–7256.
120–128.
Mathieu Salzmann, Carl Henrik Ek, Raquel Urtasun,
Yanai Elazar and Yoav Goldberg. 2018. and Trevor Darrell. 2010. Factorized orthogonal la-
Adversarial removal of demographic attributes from text data.tent spaces. In Proceedings of the Thirteenth Inter-
In Proceedings of the 2018 Conference on Empirical national Conference on Artificial Intelligence and
Methods in Natural Language Processing, pages Statistics, pages 701–708.
11–21.
Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei
Bjarke Felbo, Alan Mislove, Anders Søgaard, Chang, and Vicente Ordonez. 2019. Balanced
Iyad Rahwan, and Sune Lehmann. 2017. datasets are not enough: Estimating and mitigating
Using millions of emoji occurrences to learn any-domain representations
gender bias infor detecting
deep sentiment, emotion
image representations. In and
Pro-sarcasm.
In Proceedings of the 2017 Conference on Empirical ceedings of the IEEE International Conference on
Methods in Natural Language Processing, pages Computer Vision, pages 5310–5319.
1615–1625.
Yaroslav Ganin and Victor Lempitsky. 2015. Unsu-
pervised domain adaptation by backpropagation. In
Proceedings of the 32nd International Conference
on Machine Learning (ICML 2015), pages 1180–
1189.
Dirk Hovy and Anders Søgaard. 2015.
Tagging performance correlates with author age.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Natural
Language Processing (Volume 2: Short Papers),
pages 483–488.

Diederick P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In International
Conference on Learning Representations (ICLR).
Svetlana Kiritchenko and Saif Mohammad. 2018.
Examining gender and race bias in two hundred sentiment analysis systems.
In Proceedings of the Seventh Joint Conference

You can also read