Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of - aies-conference.com

Page created by Sandra Flores
 
CONTINUE READING
Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of - aies-conference.com
Actionable Auditing:
     Investigating the Impact of Publicly Naming Biased Performance Results of
                              Commercial AI Products
                       Inioluwa Deborah Raji                                   Joy Buolamwini
                          University of Toronto                       Massachusetts Institute of Technology
                          27 King’s College Cir                             77 Massachusetts Ave
                   Toronto, Ontario, Canada, M5S 3H7                   Cambridge, Massachusetts, 02139
                    deborah.raji@mail.utoronto.com                             joyab@mit.edu

                            Abstract                              one mechanism to incentivize corporations to address the al-
                                                                  gorithmic bias present in data-centric technologies that con-
     Although algorithmic auditing has emerged as a key           tinue to play an integral role in daily life, from governing
     strategy to expose systematic biases embedded in soft-       access to information and economic opportunities to influ-
     ware platforms, we struggle to understand the real-
     world impact of these audits, as scholarship on the
                                                                  encing personal freedoms (Julia Angwin and Kirchner 2016;
     impact of algorithmic audits on increasing algorith-         Jakub Mikians 2012; Aniko Hannak and Wilson 2017;
     mic fairness and transparency in commercial systems          Edelman and Luca 2014).
     is nascent. To analyze the impact of publicly naming            However, researchers who engage in algorithmic audits
     and disclosing performance results of biased AI sys-         risk breaching company Terms of Service, the Computer
     tems, we investigate the commercial impact of Gender         Fraud and Abuse Act (CFAA) or ACM ethical practices as
     Shades, the first algorithmic audit of gender and skin       well as face uncertainty around hostile corporate reactions.
     type performance disparities in commercial facial anal-      Given these risks, much algorithmic audit work has focused
     ysis models. This paper 1) outlines the audit design         on goals to gauge user awareness of algorithmic bias (Es-
     and structured disclosure procedure used in the Gen-
     der Shades study, 2) presents new performance metrics
                                                                  lami et al. 2017; Kevin Hamilton and Sandvig 2015) or eval-
     from targeted companies IBM, Microsoft and Megvii            uate the impact of bias on user behaviour and outcomes
     (Face++) on the Pilot Parliaments Benchmark (PPB)            (Gary Soeller and Wilson 2016; Juhi Kulshrestha 2017;
     as of August 2018, 3) provides performance results on        Edelman and Luca 2014), instead of directly challenging
     PPB by non-target companies Amazon and Kairos and,           companies to change commercial systems. Research on the
     4) explores differences in company responses as shared       real-world impact of an algorithmic audit is thus needed to
     through corporate communications that contextualize          inform strategies on how to engage corporations produc-
     differences in performance on PPB. Within 7 months           tively in addressing algorithmic bias. The Buolamwini &
     of the original audit, we find that all three targets re-    Gebru Gender Shades study (Buolamwini and Gebru 2018),
     leased new API versions. All targets reduced accuracy        which investigated the accuracy of commercial gender clas-
     disparities between males and females and darker and
     lighter-skinned subgroups, with the most significant up-
                                                                  sification services, provides an apt case study to explore au-
     date occurring for the darker-skinned female subgroup,       dit design and disclosure practices that engage companies
     that underwent a 17.7% - 30.4% reduction in error be-        in making concrete process and model improvements to ad-
     tween audit periods. Minimizing these disparities led to     dress classification bias in their offerings.
     a 5.72% to 8.3% reduction in overall error on the Pi-
     lot Parliaments Benchmark (PPB) for target corporation                            Related Work
     APIs. The overall performance of non-targets Amazon
     and Kairos lags significantly behind that of the targets,    Corporations and Algorithmic Accountability
     with error rates of 8.66% and 6.60% overall, and error
     rates of 31.37% and 22.50% for the darker female sub-        As more artificial intelligence (AI) services become main-
     group, respectively.                                         stream and harmful societal impacts become increasingly
                                                                  apparent (Julia Angwin and Kirchner 2016; Jakub Mikians
                                                                  2012), there is a growing need to hold AI providers ac-
                        Introduction                              countable. However, outside of the capitalist motivations of
An algorithmic audit involves the collection and analysis of      economic benefit, employee satisfaction, competitive advan-
outcomes from a fixed algorithm or defined model within           tage, social pressure, and recent legal developments like the
a system. Through the stimulation of a mock user popula-          EU General Data Protection Regulation, corporations still
tion, these audits can uncover problematic patterns in mod-       have little incentive to disclose details about their systems
els of interest. Targeted public algorithmic audits provide       (Diakopoulos 2016; Burrell 2016; Sandra Wachter and Rus-
                                                                  sell 2018). Thus external pressure remains a necessary ap-
Copyright c 2019, Association for the Advancement of Artificial   proach to increase transparency and address harmful model
Intelligence (www.aaai.org). All rights reserved.                 bias.
If we take the framing of algorithmic bias as a software       darker skin types. The audit then evaluates model perfor-
defect or bug that poses a threat to user dignity or access       mance across these unitary subgroups (i.e. female or darker)
to opportunity (Tramèr et al. 2015), then we can anticipate      in addition to intersectional subgroups (i.e darker female),
parallel challenges to that faced in the field of information     revealing large disparities in subgroup classification accu-
security, where practitioners regularly address and commu-        racy particularly across intersectional groups like darker fe-
nicate threats to user safety. The National Computer Emer-        male, darker male, lighter female and lighter male.
gency Readiness Team (CERT) promotes a strict procedure
named ”Coordinated Vulnerability Disclosures (CVD)” to                     Analysis of Gender Shades Audit
inform corporations of externally identified cyber security       Gender Shades Coordinated Bias Disclosure
threats in a way that is non-antagonistic, respectful of gen-
eral public awareness and careful to guard against corporate      In the Gender Shades study, the audit entity is indepen-
inaction (Allen D. Householder and King 2017). CVDs out-          dent of target corporations or its competitors and serves as
line the urgent steps of discovery, reporting, validation and     a neutral ’third-party’ auditor, similar to the expectation for
triage, remediation and then subsequent public awareness          corporate accounting auditing committees (Allen D. House-
campaigns and vendor re-deployment of a system identified         holder and King 2017).
internally or externally to pose a serious cyber threat. A sim-      This neutrality enabled auditors to approach audited cor-
ilar ”Coordinated Bias Disclosure” procedure could support        porations systematically, following a procedure sequentially
action-driven corporate disclosure practices to address algo-     outlined below that closely mirrors key recommendations
rithmic bias as well.                                             for coordinated vulnerability disclosures (CVDs) in infor-
                                                                  mation security (Allen D. Householder and King 2017).
Black Box Algorithmic Audits                                      1. Documented Vulnerability Discovery - A stated objec-
For commercial systems, the audit itself is characterized as         tive of the Gender Shades study is to document audit out-
a ”black box audit”, where the direct or indirect influence of       comes from May 2017 to expose performance vulnera-
input features on classifier accuracy or outcomes is inferred        bilities in commercial facial recognition products (Buo-
through the evaluation of a curated benchmark (Philip Adler          lamwini and Gebru 2018).
and Venkatasubramanian 2018; Riccardo Guidotti and Gi-            2. Defined Corporate Response Period with Limited
annotti 2018). Benchmark test sets like FERET (P.J. Phillips         Anonymomized Release to Audit Targets - The Gender
and Rauss 2000) and the Facial Recognition Vendor Test               Shades paper (without explicit company references) was
(FRVT) from the National Institute of Standards and Tech-            sent to Microsoft, IBM and Face++ on December 19th
nology (NIST) (Mei Ngan and Grother 2015) are of partic-             2017 (Buolamwini 2017), giving companies prior notice
ular interest, as examples specific to establishing policy and       to react before a communicated public release date, while
legal restrictions around mitigating bias in facial recognition      maintaining the strict privacy of other involved stakehold-
technologies.                                                        ers.
   In several implemented audit studies, vendor names are         3. Unrestricted Public Release Including Named Audit
kept anonymous (Brendan F. Klare and Jain 2012) or the               Targets - On February 9th, 2018, ”Facial Recognition Is
scope is scaled down to a single named target (Snow 2018;            Accurate, if You’re a White Guy” , an article by Steve
Le Chen and Wilson 2015; Juhi Kulshrestha 2017). The for-            Lohr in the technology section of The New York Times
mer fails to harness public pressure and the latter fails to         is among the first public mentions of the study (Buo-
capture the competitive dynamics of a multi-target audit -           lamwini 2017; 2018), and links to the published version
thus reducing the impetus for corporate reactions to those           of the study in Proceedings of Machine Learning Re-
studies.                                                             search, with explicit company references. This follows
                                                                     CVD procedures around alerting the public of corporate
Gender Shades                                                        vulnerabilities with explicit culprit references, following
The Gender Shades study differs from these previous cases            a particular grace period in which companies are allowed
as an external and multi-target black box audit of com-              to react before wider release. The Gender Shades public
mercial machine learning Application Program Interfaces              launch, accompanied by a video, summary visualizations
(APIs), scoped to evaluating the facial analysis task of bi-         and a website further prompts public, academic and cor-
nary gender classification (Buolamwini and Gebru 2018).              porate audiences - technical and non-technical alike - to
The contribution of the work is two-fold, serving to intro-          be exposed to the issue and respond. Finally, the paper
duce the gender and skin type balanced Pilot Parliaments             was presented on February 24th 2018 with explicit com-
Benchmark (PPB) and also execute an intersectional de-               pany references at the FAT* conference to an audience of
mographic and phenotypic evaluation of face-based gen-               academics, industry stakeholders and policymakers (Buo-
der classification in commercial APIs. The original authors          lamwini 2017).
consider each API’s model performance given the test im-          4. Joint Public Release of Communications and Updates
age attributes of gender, reduced to the binary categories of        from Corporate Response Period - Even if the issue is
male or female, as well as binary Fitzpatrick score, a nu-           resolved, CVD outlines a process to still advance with the
merical classification schema for human skin type evaluated          public release while also reporting corporate communica-
by a dermatologist, and grouped into classes of lighter and          tions and updates from the response period. In the case
Figure 2: ”Carrier puppet” audit framework overview.

                                                                  market share, the availability of desired API functions and
                                                                  overall market influence as driving factors in the decision
                                                                  to select Microsoft, IBM and Face++ (Buolamwini and Ge-
                                                                  bru 2018). Non-target corporation Kairos was selected be-
Figure 1: Gender Shades audit process overview (Buo-
                                                                  cause of the company’s public engagement with the Gender
lamwini and Gebru 2018).
                                                                  Shades study specifically and the topic of intersectional ac-
                                                                  curacy in general after the audit release (Brackeen 2018a;
  of Gender Shades, the co-author presented and linked to         2018b). Non-target corporation Amazon was selected fol-
  IBMs updated API results at the time of the public release      lowing the revelation of the active use and promotion of its
  of the initial study (Buolamwini 2017).                         facial recognition technology in law enforcement (Cagle and
                                                                  Ozer 2018).
Gender Shades Audit Design                                           The main factor in analysis, the follow up audit, closely
                                                                  follows the procedure for the initial Gender Shades study.
The Gender Shades paper contributed to the computer vision        We calculated the subgroup classification error, as defined
community the Pilot Parliaments Benchmark (PPB). PPB              below, to evaluate disparities in model performance across
is an example of what we call a ”user-representative” test        identified subgroups, enabling direct comparison between
set, meaning the benchmark does not have proportional de-         follow-up results and initial audit results.
mographic distribution of the intended user population but
representative inclusion of the diversity of that group. With        Subgroup Classification Error. Given data set D =
equal representation of each distinct subgroup of the user        (X, Y, C), a given sample input di from D belongs to a sub-
population regardless of the percentage at which that popu-       group S, which is a subset of D defined by the protected
lation is present in the sample of users, we can thus evaluate    attributes X. We define black box classifier g : X, Y 7→ c,
for equitable model performance across subgroups. Similar         which returns a prediction c from the attributes xi and yi of
to the proposed error profiling under the Unwarranted As-         a given sample input di from D. If a prediction is not pro-
sociations framework (Tramèr et al. 2015), algorithmic un-       duced (i.e. face not detected), we omit the result from our
fairness is evaluated by comparing classification accuracy        calculations.
across identified subgroups in the user-representative test set      We thus define err(S) be the error of the classifier g for
(see Figure 1).                                                   members di of subgroup S to be as follows:
   Another key element of the audit design is that the audit
targets are commercial machine learning Application Pro-                         1 − P (g(xi , yi ) = Ci |di ∈ S)
gram Interfaces (APIs). The auditors thus mirror the be-             To contextualize audit results and examine language
haviour of a single developer user for a commercial API           themes used post-audit, we considered written communica-
platform that supports the creation of applications for the       tions for all mentioned corporations. This includes exclu-
end user. Therefore, the actor being puppeted (the developer)     sively corporate blog posts and official press releases, with
has control of the application being used by the end-user         the exception of media published corporate statements, such
and is at risk of propagating bias unto the end-users of their    as an op-ed by the Kairos CEO published in TechCrunch
subsequent products. This is analogous to the ”sock pup-          (Brackeen 2018b). Any past and present website copy or
pet” algorithmic audit model (Christian Sandvig and Lang-         Software Developer Kit documentation was also considered
bort 2014) in that we pose as a puppet user of the API plat-      when determining alignment with identified themes, though
form and interact with the API in the way a developer would.      this did not factor greatly into the results.
However, as this is a puppet that influences the end user ex-
perience, we label them ”carrier puppets”, acknowledging                          Performance Results
that, rather than evaluating a final state, we are auditing the
                                                                  With the results of the follow up audit and original Gen-
bias detected in an intermediary step that can carry bias for-
                                                                  der Shades outcomes, we first analyze the differences be-
ward towards end users (see Figure 2).
                                                                  tween the performance of the targeted platforms in the orig-
                                                                  inal study and compare it to current target API performance.
                      Methodology                                 Next, we look at non-target corporations Kairos and Ama-
The design of this study is closely modeled after that of Gen-    zon, which were not included in the Gender Shades study
der Shades. Target corporations were selected from the orig-      and compare their current performance to that of targeted
inal study, which cites considerations such as the platform       platforms.
Table 1: Overall Error on Pilot Parliaments Benchmark, August 2018 (%)
                   Company           All Females Males Darker Lighter              DF      DM            LF       LM
              Target Corporations
                    Face ++              1.6       2.5        0.9     2.6       0.7      4.1      1.3     1.0      0.5
                     MSFT               0.48      0.90       0.15    0.89      0.15     1.52     0.33    0.34     0.00
                     IBM                4.41      9.36       0.43    8.16      1.17     16.97    0.63    2.37     0.26
           Non-Target Corporations
                    Amazon              8.66     18.73       0.57    15.11     3.08     31.37    1.26    7.12     0.00
                     Kairos             6.60     14.10       0.60    11.10     2.80     22.50    1.30    6.40     0.00

                  Table 2: Overall Error Difference Between August 2018 and May 2017 PPB Audit (%)
                Company     All    Females Males Darker Lighter            DF     DM      LF     LM
                 Face ++       -8.3     -18.7      0.2      -13.9     -3.9      -30.4     0.6     -8.5     -0.3
                  MSFT        -5.72     -9.70     -2.45    -12.01    -0.45     -19.28    -5.67   -1.06    0.00
                  IBM         -7.69    -10.74     -5.17    -14.24    -1.93     -17.73   -11.37   -4.43    -0.04

   The reported follow up audit was done on August 21,              • Darker females are the most improved subgroup (17.7% -
2018, for all corporations in both cases. Summary Table 1             30.4% reduction in error)
and Table 2 show percent error on misclassified faces of            • If we define the error gap to be the error difference be-
all processed faces, with undetected faces being discounted.          tween worst and best performing subgroups for a given
Calculation details are outlined in the definition for Sub-           API product, IBM reduced the error gap from 34.4% to
group Classification Error and error differences are calcu-           16.71% from May 2017 to August 2018. In the same pe-
lated by taking August 2018 error (%) and subtracting May             riod, Microsoft closed a 20.8% error gap to a 1.52% error
2017 error (%). DF is defined as darker female subgroup,              difference, and Face++ went from a 33.7% error gap to a
DM is darker male, LM is lighter male and LF is lighter fe-           3.6% error gap.
male.
                                                                    Non-Target Corporation Key Findings
Target Corporation Key Findings
                                                                    Non-target corporations Kairos and Amazon have overall er-
The target corporations from the Gender Shades study all re-        ror rates of 6.60% and 8.66% respectively. These are the
leased new API versions, with a reduction in overall error on       worst current performances of the companies analyzed in
the Pilot Parliamentary Benchmark by 5.7%, 8.3% and 7.7%            the follow up audit. Nonetheless, when comparing to the
respectively for Microsoft, Face++ and IBM. Face++ took             previous May 2017 performance of target corporations, the
the most days to release their new API in 190 days (Face++          Kairos and Amazon error rates are lower than the former
2018), while IBM was the first to release a new API version         error rates of IBM (12.1%) and Face++ (9.9%), and only
in 66 days (Puri 2018), with Microsoft updating their prod-         slightly higher than Microsofts performance (6.2%) from the
uct the day before Face++, in 189 days (Roach 2018). All            initial study. Below is a summary of key findings for non-
targeted classifiers in post-audit releases have their largest      target corporations:
error rate for the darker females subgroup and the lowest
error rate for the lighter males subgroup. This is consistent       • Kairos and Amazon perform better on male faces than fe-
with 2017 audit trends, barring Face++ which had the lowest            male faces, a trend also observed in (Buolamwini and Ge-
error rate for darker males in May 2017.                               bru 2018; Mei Ngan and Grother 2015).
   The following is a summary of substantial performance            • Kairos and Amazon perform better on lighter faces than
changes across demographic and phenotypic classes, as well             darker faces, a trend also observed in (Buolamwini and
as their intersections, after API updates :                            Gebru 2018; Jonathon Phillips and OToole 2011).
• Greater reduction in error for female faces ( 9.7% - 18.7%        • Kairos (22.5% error) and Amazon (31.4% error) have
  reduction in subgroup error) than male faces ( 0.2% -                the current worst performance for the darker female sub-
  5.17% reduction in error) .                                          group.
• Greater reduction in error for darker faces ( 12.01% -            • Kairos and Amazon (both 0.0% error) have the current
  14.24% reduction in error) than for lighter faces ( 0.45%            best performance for the lighter male subgroup.
  - 3.9% reduction in error).                                       • Kairos has an error gap of 22.5% between highest and
• Lighter males are the least improved subgroup ( 0% -                 lowest accuracy intersectional subgroups, while Amazon
  0.3% reduction in error)                                             has an error gap of 31.37%.
Discussion                                and classification bias in particular may have incited ur-
Given a clear understanding of the Gender Shades study pro-       gency in pursuing a product update. This builds on litera-
cedure and follow up audit metrics, we are able to reflect on     ture promoting fairness through user awareness and educa-
corporate reactions in the context of these results, and eval-    tion (Kevin Hamilton and Eslami 2014) - aware corporations
uate the progress made by this audit in influencing corporate     can also drastically alter the processes needed to reduce bias
action to address concerns around classification bias.            in algorithmic systems.

Reduced Performance Disparities Between                           Emphasis on Data-driven Solutions
Intersectional User Subgroups                                     These particular API updates appear to be data-driven. IBM
Building on Crenshaw’s 1989 research on the limitations           publishes the statement ”AI systems are only as effective
of only considering single axis protected groups in anti-         as the data they’re trained on” and both Microsoft and
discrimination legislation (Crenshaw 1989), a major focus         Kairos publish similar statements (Puri 2018; Roach 2018;
of the Gender Shades study is championing the relevance of        Brackeen 2018a) , implying heavily the claim that data col-
intersectional analysis in the domain of human-centered AI        lection and diversification efforts play an important role
systems. IBM and Microsoft, who both explicitly reference         in improving model performance across intersectional sub-
Gender Shades in product update releases, claim intersec-         groups. This aligns with existing research (Irene Chen and
tional model improvements on their gender classifier (Puri        Sontag 2018) advocating for increasing the diversity of data
2018; Roach 2018). These claims are substantiated by the          as a primary approach to improve fairness outcomes without
results of the August 2018 follow up audit, which reveals         compromising on overall accuracy. Nevertheless, the influ-
universal improvement across intersectional subgroups for         ence of algorithmic changes, training methodology or spe-
all targeted corporations. We also see that the updated re-       cific details about the exact composition of new training
leases of target corporations mostly impact the least accurate    datasets remain unclear in this commercial context - thus
subgroup (in this case, darker females). Although post-audit      underscoring the importance of work on open source mod-
performance for this subgroup is still the worst relative to      els and datasets that can be more thoroughly investigated.
other intersectional subgroups across all platforms, the gap
between this subgroup and the best performing subgroup -          Non-technical Advancements
consistently lighter males - reduces significantly after cor-     In addition to technical updates, we observe organizational
porate API update releases.                                       and systemic changes within target corporations following
   Additionally, with a 5.72% to 8.3% reduction in overall        the Gender Shades study. IBM published its ”Principles for
error on the Pilot Parliaments Benchmark (PPB) for target         Trust and Transparency” on May 30th 2018 (IBM 2018),
corporations, we demonstrate that minimizing subgroup per-        while Microsoft created an ”AI and Ethics in Engineering
formance disparities does not jeopardize overall model per-       and Research (AETHER) Committee, investing in strategies
formance but rather improves it, highlighting the alignment       and tools for detecting and addressing bias in AI systems” on
of fairness objectives to the commercial incentive of im-         March 29th, 2018 (Smith 2018). Both companies also cite
proved qualitative and quantitative accuracy. This key result     their involvement in Partnership for AI, an AI technology
highlights an important critique of the current model evalua-     industry consortium, as a means of future ongoing support
tion practice of using a subset of the model training data for    and corporate accountability (Puri 2018; Smith 2018).
testing, by demonstrating the functional value in testing the        Implicitly identifying the role of the API as a ”carrier”
model on a separately defined ”user representative” test set.     of bias to end users, all companies also mention the impor-
                                                                  tance of developer user accountability, with Microsoft and
Corporate Prioritization                                          IBM speaking specifically to user engagement strategies and
Although the original study (Buolamwini and Gebru 2018)           educational material on fairness considerations for their de-
expresses the concern that potential physical limitations of      veloper or enterprise clients (Puri 2018; Roach 2018).
the image quality and illumination of darker skinned sub-            Only Microsoft strongly mentions the solution of Diver-
jects may be contributing to the higher error rate for that       sity & Inclusion considerations in hiring as an avenue to ad-
group, we can see through the 2018 performance results that       dress issues[38]. The founder of Kairos specifically claims
these challenges can be overcome. Within 7 months, all tar-       his minority identity as personal motivation for participa-
geted corporations were able to significantly reduce error        tion in this issue, stating ”I have a personal connection to
gaps in the intersectional performance of their commercial        the technology,...This resonates with me very personally as
APIs, revealing that if prioritized, the disparities in perfor-   a minority founder in the face recognition space” (Brackeen
mance between intersectional subgroups can be addressed           2018a; 2018b). A cultural shift in the facial recognition in-
and minimized in a reasonable amount of time.                     dustry could thus attract and retain those paying increased
   Several factors may have contributed to this increased pri-    attention to the issue due to personal resonance.
oritization. The unbiased involvement of multiple compa-
nies may have served to put capitalist pressure on each cor-      Differences between Target and Non Target
poration to address model limitations as not to be left be-       Companies
hind or called out. Similarly, increased corporate and con-       Although prior performance for non-target companies is un-
sumer awareness on the issue of algorithmic discrimination        known, and no conclusions can be made about the rate of
product improvements, Kairos and Amazon both perform             see that they only include results above a 99% confidence
more closely to the target corporations’ pre-audit perfor-       threshold whereas Gender Shades takes the binary label with
mance than their post-audit performance.                         the higher confidence score to be the predicted gender. These
   Amazon, a large company with an employee count and            examples demonstrate the need to consider variations in re-
revenue comparable to the target corporations IBM and Mi-        sults due to prediction confidence thresholding in future au-
crosoft, seems optimistic about the use of facial recognition    dit designs.
technology despite current limitations. In a response to a          Another consideration is that the Gender Shades publi-
targeted ACLU audit of their facial recognition API (Snow        cation includes all the required information to replicate the
2018), they state explicitly, ”Our quality of life would be      benchmark and test models on PPB images (Buolamwini
much worse today if we outlawed new technology because           and Gebru 2018). It is possible that well performing models
some people could choose to abuse the technology”. On the        do not truly perform well on other diverse datasets outside of
other hand, Kairos, a small privately held company not ex-       PPB and have been overfit to optimize their performance on
plicitly referenced in the Gender Shades paper and subse-        this particular benchmark. Future work involves evaluation
quent press discussions, released a public response to the       of these systems on a separate balanced dataset of similar de-
initial Gender Shades study and seemed engaged in taking         mographic attributes to PPB or making use of metrics such
the threat of algorithmic bias quite seriously (Buolamwini       as balanced error to account for class imbalances in existing
and Gebru 2018).                                                 benchmarks.
   Despite the varying corporate stances and levels of public       Additionally, although Face++ appears to be the least en-
engagement,the targeted audit in Gender Shades was much          gaged or responsive company, a limitation of the survey to
more effective in reducing disparities in target products than   English blog posts and American mainstream media quotes
non-targeted systems.                                            (Face++ 2018), definitively excludes Chinese media outlets
                                                                 that would reveal more about the company’s response to the
Regulatory Communications                                        audit.
We additionally encounter scenarios where civil society or-
ganizations and government entities not explicitly refer-
                                                                                        Conclusion
enced in the Gender Shades paper and subsequent press            Therefore, we can see from this follow-up study that all
discussions publicly reference the results of the audit in       target companies reduced classification bias in commercial
letters, publications and calls to action. For instance, the     APIs following the Gender Shades audit. By highlighting
Gender Shades study is cited in an ACLU letter to Ama-           the issue of classification performance disparities and am-
zon from shareholders requesting its retreat from selling        plifying public awareness, the study was able to motivate
and advertising facial recognition technology for law en-        companies to prioritize the issue and yield significant im-
forcement clients (Arjuna Capital 2018). Similar calls for       provements within 7 months. When observed in the context
action to Axon AI by several civil rights groups, as well        of non-target corporation performance, however, we see that
as letters from Senator Kamala D. Harris to the EEOC,            significant subgroup performance disparities persist. Never-
FBI and FTC regarding the use of facial recognition in           theless, corporations outside the scope of the study continue
law enforcement also directly reference the work (Cold-          to speak up about the issue of classification bias (Brack-
ewey 2017). Kairos, IBM and Microsoft all agree facial           een 2018b). Even those less implicated are now facing in-
analysis technology should be restricted in certain con-         creased scrutiny by civil groups, governments and the con-
texts and demonstrate support for government regulation          sumers as a result of increased public attention to the issue
of facial recognition technology (IBM 2018; Smith 2018;          (Snow 2018). Future work includes the further development
Brackeen 2018b). In fact, Microsoft goes so far as to explic-    of audit frameworks to understand and address corporate en-
itly support public regulation (Smith 2018). Thus in addi-       gagement and awareness, improve the effectiveness of algo-
tion to corporate reactions, future work might explore the       rithmic audit design strategies and formalize external audit
engagement of government entities and other stakeholders         disclosure practices.
beyond corporate entities in response to public algorithmic         Furthermore, while algorithmic fairness may be ap-
audits.                                                          proximated through reductions in subgroup error rates or
                                                                 other performance metrics, algorithmic justice necessitates
                Design Considerations                            a transformation in the development, deployment, oversight,
                                                                 and regulation of facial analysis technology. Consequently,
Several design considerations also present opportunities for     the potential for weaponization and abuse of facial analysis
further investigation. As mentioned in Gender Shades, a          technologies cannot be ignored nor the threats to privacy or
consideration of confidence scores on these models is nec-       breaches of civil liberties diminished even as accuracy dis-
essary to get a complete view on defining real-world perfor-     parities decrease. More extensive explorations of policy, cor-
mance (Buolamwini and Gebru 2018). For instance, IBM’s           porate practice and ethical guidelines is thus needed to en-
self-reported performance on a replicated version of the         sure vulnerable and marginalized populations are protected
Gender Shades audit claims a 3.46% overall error rate on         and not harmed as this technology evolves.
their lowest accuracy group of darker females (Puri 2018) -
this result varies greatly from the 16.97% error rate we ob-
serve in our follow up audit. Upon further inspection, we
References                                    Jakub Mikians, Lszl Gyarmati, V. E. a. N. L. 2012. Detecting
                                                                      price and search discrimination on the internet. In Proceed-
Allen D. Householder, Garret Wassermann, A. M., and King,             ings of the 11th ACM Workshop on Hot Topics in Networks,
C. 2017. The cert guide to coordinated vulnerability disclo-          HotNets-XI. New York, NY, USA: ACM.
sure. Government technical report, Carnegie Mellon Univer-
                                                                      Jonathon Phillips, Fang Jiang, A. N. J. A., and OToole, A. J.
sity.
                                                                      2011. An other-race effect for face recognition algorithms. In
Aniko Hannak, Claudia Wagner, D. G. A. M. M. S., and Wil-             ACM Transactions on Applied Perception (TAP), volume 8.
son, C. 2017. Bias in online freelance marketplaces: Evi-             ACM Press.
dence from taskrabbit and fiverr. In 2017 ACM Conference,
                                                                      Juhi Kulshrestha, Motahhare Eslami, J. M. M. B. Z. S. G.
1914–1933. New York, NY, USA: ACM.
                                                                      K. P. G. K. K. 2017. Quantifying search bias: Investigat-
Arjuna Capital, As You Sow, C. A. M. J. C. D. S. o. H. D. I.          ing sources of bias for political searches in social media. In
L. F. . I. S. F. H. P. C. H. I. I. M. S. M. N. C. f. R. I. S. A. M.   Proceedings of the 2017 ACM Conference on Computer Sup-
T. S. E. G. T. S. G. o. L. W. . C. T. W. M. L. U. S. o. T. U. P.      ported Cooperative Work and Social Computing, 417–432.
W. A. M. Z. A. M. 2018. Letter from shareholders to amazon            New York, NY, USA: ACM.
ceo jeff bezos regarding rekognition.                                 Julia Angwin, Jeff Larson, S. M., and Kirchner, L. 2016.
Brackeen, B. 2018a. Face off: Confronting bias in face recog-         Machine bias.
nition ai.                                                            Kevin Hamilton, Karrie Karahalios, C. S., and Eslami, M.
Brackeen, B. 2018b. Facial recognition software is not ready          2014. A path to understanding the effects of algorithm aware-
for use by law enforcement.                                           ness. In CHI ’14 Extended Abstracts on Human Factors in
Brendan F. Klare, Mark J. Burge, J. C. K. R. W. V. B., and            Computing Systems (CHI EA ’14), 631–642. New York, NY,
Jain, A. K. 2012. Face recognition performance: Role of               USA: ACM.
demographic information. In IEEE Transactions on Infor-               Kevin Hamilton, Motahhare Eslami, A. A. K. K., and Sand-
mation Forensics and Security, volume 7, 1789–1801. New               vig, C. 2015. I always assumed that i wasn’t really that close
York, NY, USA: IEEE.                                                  to [her]: Reasoning about invisible algorithms in the news
Buolamwini, J., and Gebru, T. 2018. Gender shades: Inter-             feed. In Proceedings of 33rd Annual ACM Conference on
sectional accuracy disparities in commercial gender classifi-         Human Factors in Computing Systems. New York, NY, USA:
cation. In Proceedings of Machine Learning Research. Con-             ACM.
ference on Fairness, Accountability, and Transparency.                Le Chen, A. M., and Wilson, C. 2015. Peeking beneath the
Buolamwini, J. 2017. Gender shades.                                   hood of uber. In Proceedings of 2015 ACM Conference, 495–
Buolamwini, J. 2018. When the robot doesnt see dark skin.             508. New York, NY, USA: ACM.
Burrell, J. 2016. How the machine thinks: Understanding               Mei Ngan, M. N., and Grother, P. 2015. Face recognition ven-
opacity in machine learning algorithms. Big Data & Society.           dor test (frvt) performance of automated gender classification
                                                                      algorithms. Government technical report, US Department of
Cagle, M., and Ozer, N. 2018. Amazon teams up with govern-            Commerce, National Institute of Standards and Technology.
ment to deploy dangerous new facial recognition technology.
                                                                      Philip Adler, Casey Falk, S. A. F. T. N. G. R. C. S. B. S., and
Christian Sandvig, Kevin Hamilton, K. K., and Langbort, C.            Venkatasubramanian, S. 2018. Auditing black-box models
2014. Auditing algorithms: Research methods for detecting             for indirect influence. Knowledge and Information Systems
discrimination on internet platforms. Data and Discrimina-            54(1).
tion, Converting Critical Concerns into Productive: A Pre-
                                                                      P.J. Phillips, Hyeonjoon Moon, S. R., and Rauss, P. 2000.
conference at the 64th Annual Meeting of the International
                                                                      The feret evaluation methodology for face-recognition algo-
Communication Association.
                                                                      rithms. In IEEE Transactions on Pattern Analysis and Ma-
Coldewey, D. 2017. Sen. harris tells federal agencies to get          chine Intelligence, volume 22, 1090 – 1104. IEEE.
serious about facial recognition risks.                               Puri, R. 2018. Mitigating bias in ai models.
Crenshaw, K. 1989. Demarginalizing the intersection of race           Riccardo Guidotti, Anna Monreale, F. T. D. P., and Giannotti,
and sex: A black feminist critique of antidiscrimination doc-         F. 2018. A survey of methods for explaining black box mod-
trine, feminist theory and antiracist politics. University of         els. ACM Computing Surveys 51(5).
Chicago Legal Forum 1989(8).
                                                                      Roach, J. 2018. Microsoft improves facial recognition tech-
Diakopoulos, N. 2016. Accountability in algorithmic deci-             nology to perform well across all skin tones, genders.
sion making. Communications of the ACM 59(2):56–62.
                                                                      Sandra Wachter, B. M., and Russell, C. 2018. Counterfactual
Edelman, B., and Luca, M. 2014. Digital discrimination: The           explanations without opening the black box: Automated de-
case of airbnb.com. SSRN Electronic Journal.                          cisions and the gdpr. Harvard Journal of Law & Technology
Eslami, M.; Vaccaro, K.; Karahalios, K.; and Hamilton, K.             31(2).
2017. Be careful, things can be worse than they appear:               Smith, B. 2018. Facial recognition technology: The need for
Understanding biased algorithms and users’ behavior around            public regulation and corporate responsibility.
them in rating platforms. In ICWSM.                                   Snow, J. 2018. Amazon’s face recognition falsely matched
Face++. 2018. Notice: newer version of face detect api.               28 members of congress with mugshots.
Gary Soeller, Karrie Karahalios, C. S., and Wilson, C. 2016.          Tramèr, F.; Atlidakis, V.; Geambasu, R.; Hsu, D. J.; Hubaux,
Mapwatch: Detecting and monitoring international border               J.-P.; Humbert, M.; Juels, A.; and Lin, H. 2015. Discovering
personalization on online maps. J. ACM.                               unwarranted associations in data-driven applications with the
IBM. 2018. Ibm principles for trust and transparency.                 fairtest testing toolkit. CoRR abs/1510.02377.
Irene Chen, F. D. J., and Sontag, D. 2018. Why is my classi-
fier discriminatory? In arXiv preprint. arXiv.
You can also read