Optimizing Black-box Metrics with Iterative Example Weighting

Page created by Ethel Guzman

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Optimizing Black-box Metrics with Iterative Example Weighting

Gaurush Hiranandani† 1 Jatin Mathur 1 Harikrishna Narasimhan 2 Mahdi Milani Fard 2 Oluwasanmi Koyejo 3 1

                          Abstract                                   where the goal is to optimize an unknown classification
                                                                     metric on a large (possibly noisy) training data, given access
      We consider learning to optimize a classification
                                                                     to evaluations of the metric on a small, clean validation
      metric defined by a black-box function of the con-
                                                                     sample (Jiang et al., 2020). Our high-level approach to
      fusion matrix. Such black-box learning settings
                                                                     these learning tasks is to adaptively assigns weights to the
      are ubiquitous, for example, when the learner only
                                                                     training examples, so that the resulting weighted training
      has query access to the metric of interest, or in
                                                                     objective closely approximates the black-box metric on the
      noisy-label and domain adaptation applications
                                                                     validation sample. We then construct a classifier by using the
      where the learner must evaluate the metric via
                                                                     example weights to post-shift a class-probability estimator
      performance evaluation using a small validation
                                                                     pre-trained on the training set. This results in an efficient,
      sample. Our approach is to adaptively learn exam-
                                                                     iterative approach that does not require any re-training.
      ple weights on the training dataset such that the
      resulting weighted objective best approximates                 Indeed, example weighting strategies have been widely used
      the metric on the validation sample. We show                   to both optimize metrics and to correct for distribution shift,
      how to model and estimate the example weights                  but prior works either handle specialized forms of metric
      and use them to iteratively post-shift a pre-trained           or data noise (Sugiyama et al., 2008; Natarajan et al., 2013;
      class probability estimator to construct a classifier.         Patrini et al., 2017), formulate the example-weight learn-
      We also analyze the resulting procedure’s statis-              ing task as a difficult non-convex problem that is hard to
      tical properties. Experiments on various label                 analyze (Ren et al., 2018; Zhao et al., 2019), or employ an
      noise, domain shift, and fair classification setups            expensive surrogate re-weighting strategy that comes with
      confirm that our proposal compares favorably to                limited statistical guarantees (Jiang et al., 2020). In contrast,
      the state-of-the-art baselines for each application.           we propose a simple and effective approach to optimize a
                                                                     general black-box metric (that is a function of the confusion
                                                                     matrix) and provide a rigorous statistical analysis.
1. Introduction                                                      A key element of our approach is eliciting the weight co-
In many real-world machine learning tasks, the evaluation            efficients by probing the black-box metric at few select
metric one seeks to optimize is not explicitly available in          classifiers and solving a system of linear equations match-
closed-form. This is true for metrics that are evaluated             ing the weighted training errors to the validation metric. We
through live experiments or by querying human users (Tam-            choose the “probing” classifiers so that the linear system is
burrelli & Margara, 2014; Hiranandani et al., 2019a), or that        well-conditioned, for which we provide both theoretically-
require access to private or legally protected data (Awasthi         grounded options and practically efficient variants. This
et al., 2021), and hence cannot be written as an explicit            weight elicitation procedure is then used as a subroutine to
training objective. This is also the case when the learner           iteratively construct the final plug-in classifier.
only has access to data with skewed training distribution or         Contributions: (i) We provide a method for eliciting ex-
labels with heteroscedastic noise (Huang et al., 2019; Jiang         ample weights for linear black-box metrics (Section 3). (ii)
et al., 2020), and hence cannot directly optimize the metric         We use this procedure to iteratively learn a plug-in classifier
on the training set despite knowing its mathematical form.           for general black-box metrics (Section 4). (iii) We provide
These problems can be framed as black-box learning tasks,            theoretical guarantees for metrics that are concave functions
                                                                     of the confusion matrix under distributional assumptions
  †
    Part of this work was done when G.H. was an intern at Google     (Section 5). (iv) We experimentally show that our approach
Research, USA. 1 University of Illinois at Urbana-Champaign, Illi-   is competitive with (or better than) the state-of-the-art meth-
nois, USA 2 Google Research, USA 3 Google Research, Accra. Cor-
respondence to: Gaurush Hiranandani .         ods for tackling label noise in CIFAR-10 (Krizhevsky et al.,
                                                                     2009) and domain shift in Adience (Eidinger et al., 2014),
Proceedings of the 38 th International Conference on Machine         and optimizing with proxy labels and a black-box fairness
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Optimizing Black-box Metrics with Iterative Example Weighting

metric on Adult (Dua & Graff, 2017) (Section 7).                            distribution D. We will refer to the sample S tr as the “train-
                                                                            ing” sample, and the smaller sample S val as the “validation”
Notations: ∆m denotes the (m − 1)-dimensional simplex.
                                                                            sample. We seek to solve (1) using both these samples.
[m] = {1, . . . , m} represents an index set. onehot(j) ∈
{0, 1}m returns the one-hot encoding of j ∈ [m]. The `2                     The following are some examples of noisy training distribu-
norm a vector is denoted by k · k.                                          tions in the literature:
                                                                            Example 1 (Independent label noise (ILN) (Natarajan et al.,
2. Problem Setup                                                            2013; Patrini et al., 2017)). The distribution µ draws an
We consider a standard multiclass setup with an instance                    example (x, y) from D, and randomly flips y to ye with prob-
space X ⊆ Rd and a label space Y = [m]. We wish to learn                    ability P(e y |y), independent of the instance x.
a randomized multiclass classifier h : X →∆m that for any                   Example 2 (Cluster-dependent label noise (CDLN) (Wang
input x ∈ X predicts a distribution h(x) ∈ ∆m over the                      et al., 2020a)). Suppose each x belongs to one of k disjoint
m classes. We will also consider deterministic classifiers                  clusters g(x) ∈ [k]. The distribution µ draws (x, y) from D
h : X →[m] which map an instance x to one of m classes.                     and randomly flips y to ye with probability P(e y |y, g(x)).
Evaluation Metrics. Let D denote the underlying data                        Example 3 (Instance-dependent label noise (IDLN) (Menon
distribution over X × Y. We will evaluate the performance                   et al., 2018)). µ draws (x, y) from D and randomly flips y
of a classifier h on D using an evaluation metric E D [h], with             to ye with probability P(ey |y, x), which may depend on x.
higher values indicating better performance. Our goal is to                 Example 4 (Domain shift (DS) (Sugiyama et al., 2008)). µ
learn a classifier h that maximizes this evaluation measure:                draws x  e according to a distribution Pµ (x) different from
                                                                            P (x), but draws y from the true conditional PD (y|e
                                                                              D
                                                                                                                                     x).
                                  maxh E D [h].                      (1)
                                                                            Our approach is to learn example weights on the training
We will focus on metrics E D that can be written in terms of
                                                                            sample S tr , so that the resulting weighted empirical objec-
classifier’s confusion matrix C[h] ∈ [0, 1]m×m , where the
                                                                            tive (locally, if not globally) approximates an estimate of
i, j-th entry is the probability that the true label is i and the
                                                                            the metric E D on the validation sample S val . For ease of
randomized classifier h predicts j:
                                                                            presentation, we will assume that the metrics only depend
                 D
                Cij [h] = E(x,y)∼D [1(y = i)hj (x)] .                       on the diagonal entries of the confusion matrix, i.e., Cii ’s.
                                                                            In Appendix A, we elaborate how our ideas can be extended
The performance of the classifier can then be evaluated                     to handle metrics that depend on the entire confusion matrix.
using a (possibly unknown) function ψ : [0, 1]m×m →R+
of the confusion matrix:                                                    While our approach uses randomized classifiers, in prac-
                                                                            tice one can replace them with similarly performing deter-
                               E D [h] = ψ(CD [h]).                  (2)    ministic classifiers using, e.g., the techniques of (Cotter
Several common classification metrics take                                  et al., 2019a). In what follows, we will need the empiri-
                                           P this form, in-                                                              b val [h], where
cluding typical linear metrics ψ(C) =        ij Lij Cij for
                                                                            cal confusion matrix on the validation set C
                                                                               val      1
                                                                                           P
some reward matrix L ∈ Rm×m +    , the  F-measure  ψ(C) =                   Cij [h] = nval (x,y)∈S val 1(y = i)hj (x).
                                                                             b
P       2Cii
  i
    P
      Cij +
           P
              Cji
                  (Lewis, 1995), and   the G-mean  ψ(C) =
        j           j
 Q
    i   Cii /
                P
                  j     Cij
                              1/m
                                      (Daskalaki et al., 2006).             3. Example Weighting for Linear Metrics
We consider settings where the learner has query-access to                  We first describe our example weighting strategy for linear
the evaluation metric E D , i.e., can evaluate the metric for               functions of the diagonal entries of the confusion matrix,
any given classifier h but cannot directly write out the metric             which is given by:
as an explicit mathematical objective. This happens when
                                                                                              E D [h] =         D
                                                                                                        P
the metric is truly a black-box function, i.e., ψ is unknown,                                             i βi Cii [h]              (3)
or when ψ is known, but we have access to only a noisy                      for some (unknown) weights β1 , . . . , βm . In the next sec-
version of the distribution D needed to compute the metric.                 tion, we will discuss how to use this procedure as a subrou-
Noisy Training Distribution. For learning a classifier, we                  tine to handle more complex metrics.
assume access to a large sample S tr of ntr examples drawn
from a distribution µ, which we will refer to as the “training”             3.1. Modeling Example Weights
distribution. The training distribution µ may be the same as                We define an example weighting function W : X →Rm       +
the true distribution D, or may differ from the true distribu-              which associates m correction weights [Wi (x)]m  i=1 with
tion D in the feature distribution P(x), the conditional label              each example x so that:
distribution P(y|x), or both. We also assume access to a                               hP                         i
smaller sample S val of nval examples drawn from the true                     E(x,y)∼µ    i W i (x) 1(y = i)hi (x)  ≈ E D [h], ∀h. (4)

Optimizing Black-box Metrics with Iterative Example Weighting

Table 1: Example weights W : X →Rm×m      +      for linear metric E D [h] = hL, CD [h]i               Plug-In with Elicited Weights (for Linear Metrics)
                                                                                                                                                    Pre-trained
under the noise models in Exmp. 1–4, where Wij (x) is the weight on entry Cij . In Sec.                        Probing Classifier

3–4, we consider metrics that are functions of the diagonal confusion entries alone (i.e. L                                            Weighting
                                                                                                           Metric             Elicit      W           Plug-In
and T are diagonal), and handle general metrics in Appendix A.                                                                Weight               Post-shifts

  Model            Noise Transition Matrix                  Correction Weights
    ILN            Tij = P(e y = j|y = i)                   W(x) = L T−1
               [k]
  CDLN Tij = P(e        y = j|y = i, g(x) = k)          W(x) = L (T[g(x)] )−1                                          Frank-Wolfe Update

   IDLN         Tij (x) = P(ey = j|y = i, x)             W(x) = L (T(x))−1                          Frank-Wolfe with Elicited Gradients (for General Metrics)

     DS                       -                      Wij (x) = PD (x)/Pµ (x), ∀i, j                 Figure 1: Overview of our approach.

Indeed for the noise models in Examples 1–4, there exist                3.3. Eliciting Weight Coefficients α
weighting functions W for which the above holds with
                                                                        We next discuss how to estimate the weighting function
equality. Table 1 shows the form of the weighting function
                                                                        coefficients αi` ’s from the training sample S tr and valida-
for general linear metrics.
                                                                        tion sample S val . Notice that (8) gives a relationship be-
Ideally, the weighting function W assigns m independent                 tween statistics Φµ,` ’s computed on the training distribu-
weights for each example x ∈ X . However, in practice,                  tion µ, and the evaluation metric of interest computed on
we estimate E D using a small validation sample S val ∼                 the true distribution D. Moreover, for a fixed classifier
D. So to avoid having the example weights over-fit to the               h, the left-hand side is linear in the unknown coefficients
validation sample, we restrict the flexibility of W and set it          α = [α11 , . . . , α1L , . . . , αm
                                                                                                          1            L
                                                                                                            , . . . , αm ] ∈ RLm .
to a weighted sum of L basis functions φ` : X →[0, 1]:
                            PL                                          We therefore probe the metric Ebval at Lm different classifiers
                                     ` `
                Wi (x) =       `=1 αi φ (x),              (5)           h1,1 , . . . , h1,m , . . . , hL,1 , . . . , hL,m , which results in a set
where αi` ∈ R is the coefficient associated with basis func-            of Lm linear equations of the form in (8):
tion φ` and diagonal confusion entry (i, i).                                              P           ` b tr,` 1,1
                                                                                                                      ] = Ebval [h1,1 ],
                                                                                               `,i αi Φi [h
In practice, the basis functions can be as simple as a parti-                                                            ..
tioning of the instance space into L clusters, i.e.,:                                                                     .         (9)
                                                                                       P        `tr,` L,m          val L,m
                      φ` (x) = 1(g(x) = `),                     (6)                        `,i αi
                                                                                              Φi [h
                                                                                               b           ] = E [h
                                                                                                                  b        ],

for a clustering function g : X →[L], or may define a more              where Φb tr,` [h] = 1tr  P               `
                                                                                  i          n       (x,y)∈S tr φ (x) 1(y = i)hi (x) is
complicated soft clustering using, e.g., radial basis functions         evaluated on the training sample and the metric Ebval [h] =
(Sugiyama et al., 2008) with centers x` and width σ:                    P      b val
                                                                          i βi Cii [h] is evaluated on the validation sample.
              φ` (x) = exp −kx − x` k/2σ 2 .
                                               
                                                            (7)
                                                                        More formally, let Σ    b ∈ RLm×Lm and E b ∈ RLm denote the
3.2. φ-transformed Confusions                                           left-hand and right-hand side observations in (9), i.e.,:
                                                                             b (`,i),(`0 ,i0 ) = 1           0
                                                                                                           φ` (x)1(y = i0 )h`,i
                                                                                                      X
Expanding the weighting function in (4) gives us:                           Σ                      tr                       i0 (x),
                                                                                                 n      tr
                                                                                                       (x,y)∈S
L X
X m
          αi`   E(x,y)∼µ φ` (x) 1(y = i)hi (x) ≈ E D [h], ∀h,                       Eb(`,i) = E [h`,i ].
                                                                                                bval
                                             
                                                                                                                                                                  (10)
`=1 i=1
                |             {z              }
                              Φµ,`
                               i [h]                                                                                 b −1 E.
                                                                                                                  b =Σ
                                                                        Then the weight coefficients are given by α       b
where Φµ,` [h] ∈ [0, 1]m can be seen as a φ-transformed
confusion matrix for the training distribution µ. For ex-               3.4. Choosing the Probing Classifiers h1,1 , . . . , hL,m
                                              1
ample, if one had only one
        µ,1                basis function φ (x) = 1, ∀x,              We will have to choose the Lm probing classifiers so that
then Φi [h] = E(x,y)∼µ 1(y = i)hi (x) gives the stan-                   Σb is well-conditioned. One way to do this is to choose
dard confusion entries for the training distribution. If the            the classifiers so that Σ   b has a high value on the diagonal
                          data into L clusters, asin (6),
basis functions divides the                                             entries and a low value on the off-diagonals, i.e. choose each
then Φµ,`
        i [h] = E(x,y)∼µ 1(g(x) = `, y = i)hi (x) gives                 classifier h`,i to evaluate to a high value on Φ     b tr,` [h] and a
                                                                                                                               i
the training confusion entries evaluated on examples from                              b tr,`
                                                                                             0
                                                                                                       0 0
cluster `. We can thus re-write equation (4) as a weighted              low value on Φ   i0    [h], ∀(` , i ) 6
                                                                                                              = (`, i). This can  be framed
combination of the Φ-confusion entries:                                 as the following constraint satisfaction problem on S tr :
                 L X
                   m                                                                      For h`,i pick h ∈ H such that:
                           αi` Φµ,`
                 X
                                           D
                                i [h]   ≈ E [h], ∀h.            (8)                                        0
                                                                            b tr,` [h] ≥ γ, and Φ
                                                                                                b tr,`           0 0
                 `=1 i=1
                                                                            Φ i                   i0 [h] ≤ ω, ∀(` , i ) 6= (`, i),                                (11)

Optimizing Black-box Metrics with Iterative Example Weighting

Algorithm 1: ElicitWeights for Diagonal Linear Metrics                    Algorithm 3: Frank-Wolfe with Elicited Gradients (FW-
                bval                       1         L                    EG) for General Diagonal Metrics (also depicted in Fig. 1)
 1: Input: E       , Basis functions φ , . . . , φ : X →[0, 1],
      Training set S tr ∼ µ, Val. set S val ∼ D, h̄, , H, γ, ω            1: Input: Ebval , Basis functions φ1 , . . . , φL : X →[0, 1],
 2:   If fixed classifier:                                                      Pre-trained ηbtr : X →∆m , S tr ∼ µ, S val ∼ D, T , 
 3:     Choose h`,i (x) = φ` (x) ei (x) + (1 − φ` (x)) h̄(x)             2:   Initialize classifier h0 and c0 = diag(C        b val [h0 ])
 4:   Else:                                                                3:   For t = 0 to T − 1 do
 5:     H̄ = {τ h + (1 − τ )h̄ | h ∈ H, τ ∈ [0, ]}                        4:     if E D [h] = ψ(C11     D                D
                                                                                                           [h], . . . , Cmm [h]) for known ψ:
                                                                                       t
 6:     Pick h`,i ∈ H̄ to satisfy (11) with slack γ, ω, ∀(`, i)            5:        β = ∇ψ(c )        t
                                                                                                  P t b val
 7:   Compute Σ  b and E b using (10) with metric Ebval                    6:        Eblin [h] =        i βi Cii [h]
                         −1
 8:   Output: α  b =Σ  b E  b                                              7:     else
                                                                           8:        Eblin [h] = Ebval [h]          {small  recommendeded}
Algorithm 2: Plug-in with Elicited Weights (PI-EW) for                     9:     f = PI-EW(E , φ , ..., φL , ηbtr , S tr , S val , ht , )
                                                                                   b               b  lin    1

Diagonal Linear Metrics                                                   10:     c = diag(C
                                                                                  e            b val [fb])
                                                                                                      2               2
                                                                                  ht+1 = 1 − t+1
                                                                                                           t
 1: Input: Ebval , Basis functions φ1 , . . . , φL : X →[0, 1],           11:                               h + t+1      onehot(fb)
                                                                                    t+1               2      t       2
                                                                                                          
      Class probability model ηbtr : X →∆m for µ, Training                12:     c      = 1 − t+1 c + t+1 e             c
      set S tr ∼ µ, Validation set S val ∼ D, h̄,                        13:   End For
 2:   b = ElicitWeights(Ebval , φ1 , . . . , φL , S tr , S val , h̄, )
      α                                                                   14:   Output: b    h = hT
      Example-weights: W  ci (x) = PL α           ` `
 3:                                        `=1 bi φi (x)
 4:   Plug-in: bh(x) ∈ argmaxi∈[m] W     ci (x)b  ηitr (x)                entries. These classifiers can be succinctly written as:
 5:   Output: b  h
                                                                                     h`,i (x) = φ` (x)ei (x) + (1 − φ` (x))h̄          (12)

for some γ > ω > 0 and a sufficiently flexible hypothesis                 where we again tune  to make sure that the resulting Σ   b is
class H for which the constraints are feasible. These prob-               well-conditioned. This choice of the probing classifiers also
lems can generally be solved by formulating a constrained                 works well in practice for general basis functions φ` ’s.
classification problem (Cotter et al., 2019b; Narasimhan,                 Algorithm 1 summarizes the weight elicitation procedure,
2018). We show in Appendix G that this problem is feasible                where the probing classifiers are either constructed by solv-
and can be efficiently solved for a range of settings.                    ing the constrained satisfaction problem (11) or set to the
In practice, we do not explicitly solve (11) over a hypothesis            “fixed” classifiers in (12). In both cases, the algorithm takes
class H. Instead, a simpler and surprisingly effective strat-             a base classifier h̄ and the parameter  as input, where 
egy is to set the probing classifiers to trivial classifiers that         controls the extent to which h̄ is perturbed to construct the
predict the same class on all (or a subset of) examples. To               probing classifiers. This radius parameter  restricts the
build intuition for why this is a good idea, consider a simple            probing classifiers to a neighborhood around h̄ and will
setting with only one basis function φ1 (x) = 1, ∀x, where                prove handy in the algorithm we develop in Section 4.2.
the φ-confusions Φ   b tr,1 [h] = 1tr P
                       i               n    (x,y)∈S tr 1(y = i)hi (x)
are the standard confusion entries on the training set. In                4. Plug-in Based Algorithms
this case, a trivial classifier ei (x) = onehot(i), ∀x, which
predicts class i on all examples, yields the highest value for            Having elicited the weight coefficients α, we now seek to
b tr,1 and 0 for all other Φ
Φ                             b tr,1 , ∀j 6= i. In fact, in our exper-    learn a classifier that optimizes the left hand side of (8).
  i                             j
                                                                          We do this via the plug-in approach: first pre-train a model
iments, we set the probing classifier h1,i to a randomized
                                                                          ηbtr : X →∆m on the noisy training distribution µ to estimate
combination of ei and some fixed base classifier h̄:
                                                                          the conditional class probabilities ηbitr (x) ≈ Pµ (y = i|x),
                h1,i (x) = ei (x) + (1 − )h̄(x),                        and then apply the correction weights to post-shift ηbtr .

for large enough  so that Σ
                           b is well-conditioned.                         4.1. Plug-in Algorithm for Linear Metrics

Similarly, if the basis functions divide the data into L clus-            We first describePour approach for (diagonal) linear met-
ters (as in (6)), then we can randomize between h̄ and a                  rics E D [h] =          D
                                                                                            i βi Cii [h] in Algorithm 2. Given the
trivial classifier that predicts a particular class i on all exam-                           c : X →Rm , we seek to maximize the
                                                                          correction weights W            +
ples assigned to the cluster ` ∈ [L]. The confusion matrix                following weighted objective on the training distribution:
for the resulting classifiers will have higher values than h̄                                    hP                          i
on the (`, i)-th diagonal entry and a lower value on other                       maxh E(x,y)∼µ        i W
                                                                                                        ci (x) 1(y = i)hi (x) .

Optimizing Black-box Metrics with Iterative Example Weighting

This is a standard example-weighted learning problem, for          Assumption 1. The distributions  P D and µ are such that
which the following plug-in (post-shift) classifier is a consis-   for any linear metric E D [h] = i βi Cii [h], with kβk ≤ 1,
tent estimator (Narasimhan et al., 2015b; Yang et al., 2020):      ∃ᾱ ∈ RLm s.t.
                                                                                     P        ` µ,`       D
                                                                                        `,i ᾱi Φi [h] − E [h] ≤ ν, ∀h and

                               ci (x) ηbtr (x).
            h(x) ∈ argmaxi∈[m] W
            b                                                      kᾱk1 ≤ B, for some ν ∈ [0, 1) and B > 0.
                                        i

                                                                   The assumption states that our choice of basis functions
4.2. Iterative Algorithm for General Metrics                       φ1 , . . . , φL are such that, any linear metric on D can be
                                                                   approximated       (up to a slack ν) by a weighting Wi (x) =
To optimize generic non-linear metrics of the form E D [h] =       P ` `
     D               D                                                  ᾱ   φ   (x) of the training examples from µ. The exis-
ψ(C11  [h], . . . , Cmm [h]) for ψ : [0, 1]m →R+ , we apply Al-       `    i
                                                                   tence of such a weighting function depends on how well
gorithm 2 iteratively. We consider both cases where ψ is
                                                                   the basis functions capture the underlying distribution shift.
unknown, and where ψ is known, but needs to be optimized
                                                                   Indeed, the assumption holds for some common settings
using the noisy distribution µ. The idea is to first elicit
                                                                   in Table 1, e.g., when the noise transition T is diagonal
local linear approximations to ψ and to then learn plug-in
                                                                   (Appendix A handles a general T), and the basis func-
classifiers for the resulting linear metrics in each iteration.
                                                                   tions are set to φ1 (x) = 1, ∀x, for the IDLN setting, and
Specifically, following Narasimhan et al. (2015b), we derive       φ` (x) = 1(g(x) = `), ∀x, for the CDLN setting.
our algorithm from the classical Frank-Wolfe method (Jaggi,
                                                                   We analyze the coefficients α   b elicited by Algorithm 1 when
2013) for maximizing a smooth concave function ψ(c) over
                                                                   the probing classifiers h`,i are chosen to satisfy (11). In
a convex set C ⊆ Rm . In our case, C is the set of confusion
                                                                   Appendix C, we provide an analysis when the probing clas-
matrices CD [h] achieved by any classifier h, and is con-
                                                                   sifiers h`,i are set to the fixed choices in (12).
vex when we allow randomized classifiers (see Lemma 10,
Appendix B.3). The algorithm maintains iterates ct , and           Theorem 1 (Error bound on elicited weights). Let γ, ω >
at each step, maximizes a linear approximation to ψ at ct :        0 be such that the constraints in (11) are feasible for hypoth-
c ∈ argmaxc∈C h∇ψ(ct ), ci. The next iterate ct+1 is then
e                                                                  esis class H̄, for all `, i. Suppose Algorithm 1 chooses each
a convex combination of ct and the current solution e c.           classifier h`,i to satisfy (11), with E D [h`,i ] ∈ [c, 1], ∀`, i, for
                                                                   some c > 0. Let ᾱ be defined as in Assumption 1. Suppose
In Algorithm 3, we outline an adaptation of this Frank-Wolfe              √                        2
                                                                   γ > 2 2Lmω and ntr ≥ L (mγ −        log(Lm|H|/δ)
                                                                                                         √            . Fix δ ∈ (0, 1).
algorithm to our setting, where we maintain a classifier ht                                          2     2Lmω)2
and an estimate of the diagonal confusion entries ct from the      Then w.p. ≥ 1 − δ over draws of S tr and S val from µ and D
validationPsample S val . At each step, we linearize ψ using       resp., the coefficients α
                                                                                           b output by Algorithm 1 satisfies:
E [h] = i βit C
  lin             b val [h], where β t = ∇ψ(ct ), and invoke
                                                                   kα
                                                                    b − ᾱk ≤
 b
                   ii
the plug-in method in Algorithm 2 to optimize the linear                    s                  s                   √
approximation Eblin . When the mathematical form of ψ is                      L log( Lm|H| )     L log( Lmδ )
                                                                                                                    
                                                                       Lm              δ                          ν Lm
known, one can directly compute the gradient β t . When it is      O                         +                  +        ,
                                                                       γ2          ntr              c2 nval         γ
not known, we can simply set Eblin [h] = Ebval [h], but restrict
the weight elicitation routine (Algorithm 1) to choose its         where the term |H| can be replaced by a measure of capacity
probing classifiers h`,i ’s from a small neighborhood around       of the hypothesis class H.
the current classifier ht (in which ψ is effectively linear).
This can be done by passing h̄ = ht to the weight elicitation      Because the probing classifiers are chosen using the training
routine, and setting the radius  to a small value.                set alone, it is only the sampling errors from the training set
                                                                   that depend on the complexity of H, and not those from the
Each call to Algorithm 2 uses the training and validation set      validation set. This suggests robustness of our approach to a
to elicit example weights for a local linear approximation         small validation set as long as the training set is sufficiently
to ψ, and uses the weights to construct a plug-in classifier.      large and the number of basis functions is reasonably small.
The final output is a randomized combination of the plug-
in classifiers from each step. Note that Algorithm 3 runs          For the iterative plug-in method in Algorithm 3, we bound
efficiently for reasonable values of L and m. Indeed the           the gap between the metric value E D [b   h] for the output clas-
runtime is almost always dominated by the pre-training of                 h on the true distribution D, and the optimal value.
                                                                   sifier b
the base model ηbtr , with the time taken to elicit the weights    We handle the case where the function ψ is known and its
(e.g. using (12)) being relatively inexpensive (see App. E).       gradient ∇ψ can be computed in closed-form. The more
                                                                   general case of an unknown ψ is handled in Appendix D.
5. Theoretical Guarantees                                          The above bound depends on the gap between the estimated
                                                                   class probabilities ηbitr (x) for the training distribution and
We provide theoretical guarantees for the weight elicitation       true class probabilities ηitr (x) = P(y = i|x), as well as the
procedure and the plug-in methods in Algorithms 1–3.               quality of the coefficients α b provided by the weight estima-

Optimizing Black-box Metrics with Iterative Example Weighting

tion subroutine, as measured by κ(·). One can substitute Section 7 that our method yields better accuracies. Other re-
κ(·) with, e.g., the error bound provided in Theorem 1. lated black-box learning methods include Zhao et al. (2019),
Theorem 2 (Error Bound for FW-EG). Let E D [h] = Ren et al. (2018), and Huang et al. (2019), who learn a
ψ(C11 D D
[h], . . . , Cmm [h]) for a known concave function ψ : (weighted) loss to approximate the metric, but do so using
[0, 1]m →R+ , which is Q-Lipschitz and λ-smooth. Fix computationally expensive procedures (e.g. meta-gradient
δ ∈ (0,P 1). Suppose Assumption 1 holds, and for any linear descent or RL) that often require retraining the model from
metric i βi CiiD [h], whose associated weight coefficients scratch, and come with limited theoretical analysis.
is ᾱ with kᾱk ≤ B, w.p. ≥ 1 − δ over draw of S tr and S val , Methods for distribution shift. The literature on distri-
the weight estimation routine in Alg. 1 outputs coefficients α b bution shift is vast, and so we cover a few representative
tr val
with kα b − ᾱk ≤ √κ(δ, n , n ), for some function κ(·) > 0. papers; see (Frénay & Verleysen, 2013; Csurka, 2017) for a
Let B 0 = B + Lm κ(δ/T, ntr , nval ). Then w.p. ≥ 1 − δ comprehensive discussion. For the independent label noise
over draws of S tr and S val from D and µ resp., the classifier setting (Natarajan et al., 2013), Patrini et al. (2017) pro-
h output by Algorithm 3 after T iterations satisfies:
b
pose a loss correction approach that first trains a model with
max E D [h] − E D [b
h] ≤ noisy label, use its predictions to estimate the noise tran-
h
√ sition matrix, and then re-trains model with the corrected
2QB 0 Ex kη tr (x) − ηbtr (x)k1 + 4Q Lm κ( Tδ , ntr , nval ) +

loss. This approach is however tailored to optimize linear
r ! metrics; whereas, we can handle more complex metrics as
m log(m) log(nval ) + log(m/δ) λ
O λm + + Qν . well without re-training the underlying model. A plethora
nval T of approaches exist for tackling domain shift, including clas-
The proof in turn derives an error bound for the plug-in sical importance weighting (IW) strategies (Sugiyama et al.,
classifier in Algorithm 2 for linear metrics (see App. B.2). 2008; Shimodaira, 2000; Kanamori et al., 2009; Lipton et al.,
2018) that work in two steps: estimate the density ratios and
train a model with the resulting weighted loss. One such
6. Related Work approach is Kernel Mean Matching (Huang et al., 2006),
Methods for closed-form metrics. There has been a va- which matches covariate distributions between training and
riety of work on optimizing complex evaluation metrics, test sets in a high dimensional RKHS feature space. These
including both plug-in type algorithms (Ye et al., 2012; IW approaches are however prone to over-fitting when used
Narasimhan et al., 2014; Koyejo et al., 2014; Narasimhan with deep networks (Byrd & Lipton, 2019). More recent
et al., 2015b; Yan et al., 2018), and those that use con- iterative variants seek to remedy this (Fang et al., 2020).
vex surrogates for the metric (Joachims, 2005; Kar et al.,
2014; 2016; Narasimhan et al., 2015a; Eban et al., 2017;
7. Experiments
Narasimhan et al., 2019; Hiranandani et al., 2020). These We run experiments on four classification tasks, with both
methods rely on the test metric having a specific closed-form known and black-box metrics, and under different label
structure and do not handle black-box metrics. noise and domain shift settings. All our experiments use
a large training sample, which is either noisy or contains
Methods for black-box metrics. Among recent black-
missing attributes, and a smaller clean (and complete) vali-
box metric learning works, the closest to ours is Jiang
dation sample. We always optimize the cross-entropy loss
et al. (2020), who learn a weighted combination of sur-
for learning ηbtr (x) ≈ Pµ (Y |x) using the training set (or
rogate losses to approximate the metric on a validation set.
ηbval (x) ≈ PD (Y |x) for some baselines), where the models
Like us, they probe the metric at multiple classifiers, but
are varied across experiments. For monitoring the quality
their approach has several drawbacks on both practical and
of ηbtr and ηbval , we sample small subsets hyper-train and
theoretical fronts. Firstly, Jiang et al. (2020) require retrain-
hyper-val data from the original training and validation data,
ing the model in each iteration, which can be time-intensive,
respectively. We repeat our experiments over 5 random train-
whereas we only post-shift a pre-trained model. Secondly,
vali-test splits, and report the mean and standard deviation
the procedure they prescribe for eliciting gradients requires
for each metric. We will use ∗ , ∗∗ , and ∗∗∗ to denote that
perturbing the model parameters multiple times, which can
the differences between our method and the closest baseline
be very expensive for large deep networks, whereas we only
are statistically significant (using Welch’s t-test) at a confi-
require perturbing the predictions from the model. More-
dence level of 90%, 95%, and 99%, respectively. Table 6
over, the number of perturbations they need grows polyno-
in App. H summarizes the datasets used. The source code
mially with the precision with which they need to estimate
(along with random seeds) is provided on the link below.1
the loss coefficients, whereas we only require a constant
number of them. Lastly, their approach does not come with Common baselines: We use representative baselines from
strong statistical guarantees, whereas ours does. Besides 1
https://github.com/koyejolab/fweg/
these benefits over (Jiang et al., 2020), we will also see in

Optimizing Black-box Metrics with Iterative Example Weighting

Table 2: Test accuracy for noisy label experiment on CIFAR-10.         Table 3: Test G-mean for proxy label experiment on Adult.
            Cross-entropy [train] 0.582 ± 0.007                                Cross-entropy [train]   0.654 ± 0.002
            Cross-entropy [val]      0.386 ± 0.031                             Cross-entropy [val]     0.394 ± 0.064
            Learn-to-reweight        0.651 ± 0.017                             Opt-metric [val]        0.652 ± 0.027
            Plug-in [train-val]      0.733 ± 0.044                             Learn-to-reweight       0.668 ± 0.003
            Forward Correction       0.757 ± 0.005                             Plug-in [train-val]     0.672 ± 0.013
            Fine-tuning              0.769 ± 0.005                             Forward Correction      0.214 ± 0.004
            PI-EW                    0.781 ± 0.019                             Fine-tuning             0.631 ± 0.017
                                                                               Importance Weights      0.662 ± 0.024
the black-box learning (Jiang et al., 2020), iterative re-                     Adaptive Surrogates     0.682 ± 0.002
                                                                               FW-EG [unknown ψ] 0.685 ± 0.002∗∗
weighting (Ren et al., 2018), label noise correction (Patrini
                                                                               FW-EG [known ψ]         0.685 ± 0.001∗
et al., 2017), and importance weighting (Huang et al., 2006)
literatures. First, we list the ones common to all experiments.     PLANE, DEER → HORSE, CAT ↔ DOG, with a flip proba-
                                                                    bility of 0.6. For ηbtr and ηbval , we use the same ResNet-14
1. Cross-entropy [train]: Maximizes accuracy on the                 architecture as (Patrini et al., 2017), trained using SGD with
   training set and predicts: bh(x) ∈ argmaxi∈[m] ηbitr (x).        momentum 0.9, weight decay 1e−4 , and learning rate 0.01,
2. Cross-entropy [val]: Maximizes accuracy on the vali-             which we divide by 10 after 40 and 80 epochs (120 in total).
   dation set and predicts: bh(x) ∈ argmaxi∈[m] ηbival (x).
                                                                    We additionally compare with the Forward Correction
3. Fine-tuning: Fine-tunes the pre-trained ηbtr using the
                                                                    method of (Patrini et al., 2017), a specialized method for cor-
   validation data, monitoring the cross-entropy loss on the
                                                                    recting independent label noise, which estimates the noise
   hyper-val data for early stopping.
                                                                    transition matrix T using predictions from ηbtr on the train-
4. Opt-metric [val]: For metrics ψ(CD [h]), for which ψ             ing set, and retrains it with the corrected loss, thus training
   is known, trains a model to directly maximize the metric         the ResNet twice. We saw a notable drop with this method
   on the small validation set using the Frank-Wolfe based          when we used the (small) validation set to estimate T.
   algorithm of (Narasimhan et al., 2015b).
5. Learn-to-reweight (Ren et al., 2018): Jointly learns             We apply the proposed PI-EW method for linear metrics,
   example weights, with the model, to maximize accuracy            using a weighting function W defined with one of two
   on the validation set; does not handle specialized metrics.      choices for the basis functions (chosen via cross-validation):
                                                                    (i) a default basis function that clusters all the points together
6. Plug-in [train-val]: Constructs a classifier b     h(x) ∈
                                                                    φdef (x) = 1 ∀x, and (ii) ten basis functions φ1 , . . . , φ10 ,
   argmaxi wi ηbival (x), where the weights wi ∈ R are tuned
                                                                    each one being the average of the RBF kernels (see (7))
   to maximize the given metric on the validation set, using
                                                                    centered at validation points belonging to a true class. The
   a coordinate-wise line search (details in Appendix F).
                                                                    RBF kernels are computed with width 2 on UMAP-reduced
7. Adaptive Surrogates (Jiang et al., 2020): Learns a               50-dimensional image embeddings (McInnes et al., 2018).
   weighted combination of surrogate losses (evaluated on
   clusters of examples) to approximate the metric on the           As shown in Table 2, PI-EW achieves significantly better test
   validation set. Since this method is not directly amenable       accuracies than all the baselines. The results for Forward
   for use with large neural networks (see Section 6), we           Correction matches those in (Patrini et al., 2017); unlike
   compare with it only when using linear models, and               this method, we train the ResNet only once, but achieve
   present additional comparisons in App. H (Table 7).              2.4% higher accuracy. Cross-entropy [val] over-fits badly,
                                                                    and yields the least test accuracy. Surprisingly, the sim-
Hyper-parameters: The learning rate for Fine-tuning is              ple fine-tuning yields the second-best accuracy. A possible
chosen from 1e{−6,...,−4} . For PI-EW and FW-EG, we                 reason is that the pre-trained model learns a good feature
tune the parameter  from {1, 0.4, 1e−{4,3,2,1} }. The line         representation, and the fine-tuning step adapts well to the
search for Plug-in is performed with a spacing of 1e−4 . The        domain change. We also observed that PI-EW achieves
only hyper-parameters the other baselines have are those for        better accuracy during cross-validation with ten basis func-
training ηbtr and ηbval , which we state in the individual tasks.   tions, highlighting the benefit of the underlying modeling in
                                                                    PI-EW. Lastly, in Figure 2(a), we show the elicited (class)
7.1. Maximizing Accuracy under Label Noise                          weights with the default basis function (φdef (x) = 1 ∀x),
In our first task, we train a 10-class image classifier for the     where e.g. because BIRD → PLANE, the weight on BIRD
CIFAR-10 dataset (Krizhevsky et al., 2009), replicating the         is upweighted and that on PLANE is down-weighted.
independent (asymmetric) label noise setup from (Patrini
et al., 2017). The evaluation metric we use is accuracy. We         7.2. Maximizing G-mean with Proxy Labels
take 2% of original training data as validation data and flip       Our next experiment borrows the “proxy label” setup from
labels in the remaining training set based on the follow-           Jiang et al. (2020) on the Adult dataset (Dua & Graff, 2017).
ing transition matrix: TRUCK → AUTOMOBILE, BIRD →                   The task is to predict whether a candidate’s gender is male,

Optimizing Black-box Metrics with Iterative Example Weighting

Table 4: Test F-measure for domain shift experiment on Adience. Table 5: Black-box fairness metric on the test set for Adult.
Cross-entropy [train] 0.760 ± 0.014 Cross-entropy [train] 0.736± 0.005
Cross-entropy [val] 0.708 ± 0.022 Cross-entropy [val] 0.610 ± 0.020
Opt-metric [val] 0.760 ± 0.014 Learn-to-reweight 0.729 ± 0.007
Plug-in [train-val] 0.759 ± 0.014 Fine-tuning 0.738± 0.005
Importance Weights [KMM] 0.760 ± 0.013 Adaptive Surrogates 0.812 ± 0.004
Learn-to-reweight 0.773 ± 0.009 Plug-in [train-val] 0.812 ± 0.005
Fine-tuning 0.781 ± 0.014 FW-EG 0.822 ± 0.002∗∗∗
FW-EG [unknown ψ] 0.815 ± 0.013∗∗∗
FW-EG [known ψ] 0.804 ± 0.015∗∗∗
7.3. Maximizing F-measure under Domain Shift

but the training set contains only a proxy for the true label. We now move on to a domain shift application (see Ex-
We sample 1% validation data from the original training ample 4). The task is to learn a gender recognizer for the
data, and replace the labels in the remaining sample with Adience face image dataset (Eidinger et al., 2014), but with
the feature ‘relationship-husband’. The label noise here is the training and test datasets containing images from differ-
instance-dependent (see Example 3), and we seek to maxi- ent age groups (domain shift based on age). We use images
1/m
belonging to age buckets 1–5 for training (12.2K images),
Q P
mize the G-mean metric: ψ(C) = i C ii / j C ij .
and evaluate on images from age buckets 6–8 (4K images).
We train ηbtr and ηbval using linear logistic regression using For the validation set, we sample 20% of the 6–8 age bucket
SGD with a learning rate of 0.01. As additional baselines, images. Here we aim to maximize the F-measure.
we include the Adaptive Surrogates method of (Jiang et al.,
2020) and Forward Correction (Patrini et al., 2017). The For ηbtr and ηbval , we use the same ResNet-14 model from
inner and outer learning rates for Adaptive Surrogates are the CIFAR-10 experiment, except that the learning rate is
each cross-validated in {0.1, 1.0}. We also compare with a divided by 2 after 10 epochs (20 in total). As an additional
simple Importance Weighting strategy, where we first train baseline, we compute importance weights using Kernel
a logistic regression model f to predict if an example (x, y) Mean Matching (KMM) (Huang et al., 2006), and train the
belongs to the validation data, and train a gender classifier same ResNet model with a weighted loss. Since the image
with the training examples weighted by f (x, y)/(1−f (x, y)). size is large for directly applying KMM, we first compute
the 2048-dimensional ImageNet embedding (Krizhevsky
We choose between three sets of basis functions (using cross- et al., 2012) for the images and further reduce them to 10-
validation): (i) a default basis function φdef (x) = 1 ∀x, dimensions via UMAP. The KMM weights are learned on
(ii) φdef , φpw , φnpw , where φpw (x) = 1(xpw = 1) and the 10-dimensional embedding. For the basis functions,
φnpw (x) = 1(xnpw = 1) use features ‘private-workforce’ besides the default basis φdef (x) = 1 ∀x, we choose from
and ‘non-private-workforce’ to form hard clusters, (iii) φdef , subsets of six RBF basis functions φ1 , . . . , φ6 , centered at
φpw , φnpw , φinc , where φinc (x) = 1(xinc = 1) uses the bi- points from the validation set, each representing one of six
nary feature ‘income’. These choices are motivated from age-gender combinations. We use the same UMAP embed-
those used by (Jiang et al., 2020), who compute surrogate ding as KMM to compute the RBF kernels.
losses on the individual clusters. We provide their Adaptive
Surrogates method with the same clustering choices. Table 4 presents the test F-measure values. Both variants of
FW-EG algorithm provide statistically significant improve-
Table 3 summarizes our results. We apply both variants of ments over the baselines. Both Fine-tuning and Learning-
our FW-EG method for a non-linear metric ψ, one where to-reweight improve over plain cross-entropy optimization
ψ is known and its gradient is available in closed-form, (train), however only moderately, likely because of the small
and the other where ψ is assumed to be unknown, and is size of the validation set, and because these methods are not
treated as a general black-box metric. Both variants perform tailored to optimize the F-measure.
similarly and are better than the baselines. Adaptive Sur-
rogates comes a close second, but underperforms by 0.3% 7.4. Maximizing Black-box Fairness Metric
(with results being statistically significant). While the im-
provement of FW-EG over Adaptive Surrogates is small, We next handle a black-box metric given only query access
the latter is time intensive as, in each iteration, it re-trains to its value. We consider a fairness application where the
a logistic regression model. We verify this empirically in goal is to balance classification performance across multi-
Figure 2(b) by reporting run-times for Adaptive Surrogates ple protected groups. The groups that one cares about are
and our method FW-EG (including the pre-training time) known, but due to privacy or legal restrictions, the protected
against the choices of basis functions (clustering features). attribute for an individual cannot be revealed (Awasthi et al.,
We see that our approach is 5× faster for this experiment. 2021). Instead, we have access to an oracle that reveals the
Lastly, Forward Correction performs poorly, likely because value of the fairness metric for predictions on a validation
its loss correction is not aligned with this label noise model. sample, with the protected attributes absent from the train-

Optimizing Black-box Metrics with Iterative Example Weighting

                                                                                                                                                                                                                                                           Test Accuracy on Fixed Test Data
                                                                                                                         FW-EG (ours)                                                                                                          Base Std

                                                                                    Wall Clock Time (in seconds)
           0.175                                                                                                                                                                            0.60                                               0.004
                                                                                                                   100   Adaptive Surrogates                                                                                                                                                  0.82

                                                                                                                                                                   Test Accuracy of PI-EW
                                                                                                                                                                                                                                               0.008
           0.150                                                                                                                                                                            0.58                                               0.012
                                                                                                                                                                                                                                               0.016                                          0.80
                                                                                                                    80
                                                                                                                                                                                            0.56                                               0.020
           0.125
                                                                                                                                                                                                                                               PI-EW Std                                      0.78
 weights

                                                                                                                                                                                                                                               0.004
           0.100                                                                                                    60                                                                      0.54
                                                                                                                                                                                                                                               0.008
                                                                                                                                                                                                                                               0.012
                                                                                                                                                                                                                                                                                              0.76
           0.075
                                                                                                                                                                                            0.52
                                                                                                                                                                                                                                               0.016
                                                                                                                    40
                                                                                                                                                                                                                                               0.020                                          0.74
                                                                                                                                                                                            0.50                                                                                                                                                CE [train]
           0.050

                                                                                                                    20
                                                                                                                                                                                            0.48                                                                                              0.72                                              Fine-tuning
           0.025
                                                                                                                                                                                                                                                                                                                                                PI-EW
                                                                                                                                                                                            0.46                                                                                              0.70
           0.000                                                                                                     0
                   plane vehicle bird   cat   deer dog   frog horse ship   truck                                           1                   3              4                                    0.50    0.52   0.54   0.56   0.58   0.60   0.62                                                   1   2   3   4   6    8        12          16       20
                                               Classes                                                                    Number of Grouping Features (Sec. 7.2)                                          Log Loss of Base Model on Hyper-train                                                                      Size of Validation Data            ×102

                (a)                               (b)                               (c)                                (d)
Figure 2: (a) Elicited (class) weights for CIFAR-10 by PI-EW for the default basis (Sec. 7.1); (b) Run-time for FW-EG and Adaptive
Surrogates (Jiang et al., 2020) vs no. of grouping features on proxy label task (Sec. 7.2); (c) Effect of quality of the base model ηbtr on
Adult (Sec. 7.2): as the base model’s quality improves, the test accuracies of PI-EW also improves; (d) Effect of the validation set size on
Adience (Sec. 7.3): PI-EW performs better than fine-tuning even for small validation sets, while both improve with larger ones.

ing sample. This setup is different from recent work on                                                                                                                                     for the Adult experiment (Section 7.2), varies with the qual-
learning fair classifiers from incomplete group information                                                                                                                                 ity of the base model ηbtr . We save an estimate of ηbtr after
(Lahoti et al., 2019; Wang et al., 2020b), in that the focus                                                                                                                                every 50 batches (batch size 32) while training the logistic
here is on optimizing any given black-box fairness metric.                                                                                                                                  regression model, and use these estimates as inputs to PI-
                                                                                                                                                                                            EW. As shown in Figure 2(c), the test accuracies for PI-EW
We use the Adult dataset, and seek to predict whether the
                                                                                                                                                                                            improves with the quality of ηbtr (as measured by the log loss
candidate’s income is greater than $50K, with gender as the
                                                                                                                                                                                            on the hyper-train set). This is in accordance with Theo-
protected group. The black-box metric we consider (whose
                                                                                                                                                                                            rem 2. One can further improve the quality of the estimate
form is unknown to the learner) is the geometric mean of
                                                                                                                                                                                            η tr by using calibration techniques (Guo et al., 2017), which
the true-positive (TP) and true-negative (TN) rates, evalu-
                                                                                                                                                                                            will likely enhance the performance of PI-EW as well.
ated separately on the male and female examples, which
promotes equal performance for both groups and classes:                                                                                                                                     Next, we show that PI-EW is robust to changes in the vali-
                                                          1/4                                                                                                                              dation set size when trained on the Adience experiment in
E D [h] = TPmale [h] TNmale [h]) TPfemale [h] TNfemale [h]     .
                                                                                                                                                                                            Section 7.3 to optimize accuracy. We set aside 50% of 6–8
We train the same logistic regression models as in previous                                                                                                                                 age bucket data for testing, and sample varying sizes of vali-
Adult experiment in Section 7.2. Along with the basis func-                                                                                                                                 dation data from the rest. As shown in Figure 2(d), PI-EW
tions φdef , φpw and φnpw we used there, we additionally in-                                                                                                                                generally performs better than fine-tuning even for small
clude two basis φhs and φwf based on features ‘relationship-                                                                                                                                validation sets, while both improve with larger ones. The
husband’ and ‘relationship-wife’, which we expect to have                                                                                                                                   only exception is 100-sized validation set (0.8% of training
correlations with gender.2 We include two baselines that can                                                                                                                                data), where we see overfitting due to small validation size.
handle black-box metrics: Plug-in [train-val], which tunes
a threshold on ηbtr by querying the metric on the validation                                                                                                                                8. Conclusion and Discussion
set, and Adaptive Surrogates. The latter is cross-validated
                                                                                                                                                                                        We have proposed the FW-EG method for optimizing black-
on the same set of clustering features (i.e., basis functions
                                                                                                                                                                                        box metrics given query access to the evaluation metric on
in our method) for computing the surrogate losses.
                                                                                                                                                                                        a small validation set. Our framework includes common
As seen in Table 5, FW-EG yields the highest black-box met-                                                                                                                             distribution shift settings as special cases, and unlike prior
ric on the test set, Adaptive Surrogates comes in second, and                                                                                                                           distribution correction strategies, is able to handle general
surprisingly the simple plug-in approach fairs better than the                                                                                                                          non-linear metrics. A key benefit of our method is that it
other baselines. During cross-validation, we also observed                                                                                                                              is agnostic to the choice of ηbtr , and can thus be used to
that the performance of FW-EG improves with more basis                                                                                                                                  post-shift pre-trained deep networks, without having to re-
functions, particularly with the ones that are better corre-                                                                                                                            train them. We showed that the post-shift example weights
lated with gender. Specifically, FW-EG with basis functions                                                                                                                             can be flexibly modeled with various choices of basis func-
{φdef , φpw , φnpw , φwf , φhs } achieves approximately 1% bet-                                                                                                                         tions (e.g., hard clusters, RBF kernels, etc.) and empirically
ter performance than both FW-EG with φdef basis function                                                                                                                                demonstrated their efficacies. We look forward to further
and FW-EG with basis functions {φdef , φpw , φnpw }.                                                                                                                                    improving the results with more nuanced basis functions.

7.5. Ablation Studies
                                                                                                                                                                                            Acknowledgements
We close with two sets of experiments. First, we analyze
                                                                                                                                                                                        We thank the anonymous reviewers for their helpful and
how the performance of PI-EW, while optimizing accuracy
                                                                                                                                                                                        constructive feedback. We also thank Google Cloud for
             2
     The only domain knowledge we use is that the protected group                                                                                                                       supporting this research with cloud computing credits.
is “gender”; beyond this, the form of the metric is unknown, and
importantly, an individual’s gender is not available.

Optimizing Black-box Metrics with Iterative Example Weighting

References                                                      Frénay, B. and Verleysen, M. Classification in the presence
                                                                   of label noise: a survey. IEEE transactions on neural
Awasthi, P., Beutel, A., Kleindessner, M., Morganstern, J.,
                                                                   networks and learning systems, 25(5):845–869, 2013.
 and Wang, X. Evaluating fairness of machine learning
 models under uncertain and incomplete information. In          Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On cali-
 FAccT, 2021.                                                     bration of modern neural networks. In International Con-
                                                                  ference on Machine Learning, pp. 1321–1330. PMLR,
Byrd, J. and Lipton, Z. What is the effect of importance
                                                                  2017.
  weighting in deep learning? In International Conference
  on Machine Learning, pp. 872–881. PMLR, 2019.                 Hiranandani, G., Boodaghians, S., Mehta, R., and Koyejo,
                                                                  O. Performance metric elicitation from pairwise classifier
Cotter, A., Gupta, M., and Narasimhan, H. On making
                                                                  comparisons. In The 22nd International Conference on
  stochastic classifiers deterministic. In Advances in Neural     Artificial Intelligence and Statistics, pp. 371–379, 2019a.
  Information Processing Systems, 2019a.
                                                                Hiranandani, G., Boodaghians, S., Mehta, R., and Koyejo,
Cotter, A., Jiang, H., Wang, S., Narayan, T., You, S., Srid-      O. O. Multiclass performance metric elicitation. In Ad-
  haran, K., and Gupta, M. R. Optimization with non-              vances in Neural Information Processing Systems, pp.
  differentiable constraints with applications to fairness,       9351–9360, 2019b.
  recall, churn, and other goals. Journal of Machine Learn-
  ing Research (JMLR), 20(172):1–59, 2019b.                     Hiranandani, G., Vijitbenjaronk, W., Koyejo, S., and Jain,
                                                                  P. Optimization and analysis of the pap@ k metric for
Csurka, G. A comprehensive survey on domain adaptation            recommender systems. In International Conference on
  for visual applications. Domain adaptation in computer          Machine Learning, pp. 4260–4270. PMLR, 2020.
  vision applications, pp. 1–35, 2017.
                                                                Huang, C., Zhai, S., Talbott, W., Martin, M. B., Sun, S.-
Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz,      Y., Guestrin, C., and Susskind, J. Addressing the loss-
  S. Multiclass learnability and the erm principle. In Pro-       metric mismatch with adaptive loss alignment. In Interna-
  ceedings of the 24th Annual Conference on Learning              tional Conference on Machine Learning, pp. 2891–2900.
  Theory, pp. 207–232. JMLR Workshop and Conference               PMLR, 2019.
  Proceedings, 2011.
                                                                Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and
Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz,       Smola, A. Correcting sample selection bias by unlabeled
  S. Multiclass learnability and the erm principle. Journal       data. Advances in neural information processing systems,
  of Machine Learning Research, 16:2377–2404, 2015.              19:601–608, 2006.
Daskalaki, S., Kopanas, I., and Avouris, N. Evaluation          Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse
  of classifiers for an uneven class distribution problem.        convex optimization. In ICML, 2013.
 Applied Artificial Intelligence, 20:381–417, 2006.
                                                                Jiang, Q., Adigun, O., Narasimhan, H., Fard, M. M., and
Demmel, J. W. Applied numerical linear algebra. SIAM,              Gupta, M. Optimizing black-box metrics with adaptive
 1997.                                                             surrogates. In ICML, 2020.
Dua, D. and Graff, C. UCI machine learning repository,          Joachims, T. A support vector method for multivariate
  2017. URL http://archive.ics.uci.edu/ml.                        performance measures. In Proceedings of the 22nd inter-
                                                                  national conference on Machine learning, pp. 377–384.
Eban, E., Schain, M., Mackey, A., Gordon, A., Rifkin, R.,         ACM, 2005.
  and Elidan, G. Scalable learning of non-decomposable
  objectives. In Artificial intelligence and statistics, pp.    Kanamori, T., Hido, S., and Sugiyama, M. A least-squares
  832–840. PMLR, 2017.                                            approach to direct importance estimation. The Journal of
                                                                  Machine Learning Research, 10:1391–1445, 2009.
Eidinger, E., Enbar, R., and Hassner, T. Age and gender
  estimation of unfiltered faces. IEEE Transactions on          Kar, P., Narasimhan, H., and Jain, P. Online and stochastic
  Information Forensics and Security, 9(12):2170–2179,            gradient methods for non-decomposable loss functions.
  2014.                                                           arXiv preprint arXiv:1410.6776, 2014.

Fang, T., Lu, N., Niu, G., and Sugiyama, M. Rethinking          Kar, P., Li, S., Narasimhan, H., Chawla, S., and Sebastiani,
  importance weighting for deep learning under distribution       F. Online optimization methods for the quantification
  shift. arXiv preprint arXiv:2006.04662, 2020.                   problem. In Proceedings of the 22nd ACM SIGKDD

Optimizing Black-box Metrics with Iterative Example Weighting

  international conference on knowledge discovery and           Natarajan, B. K. On learning sets and functions. Machine
  data mining, pp. 1625–1634, 2016.                               Learning, 4(1):67–97, 1989.
Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon,    Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari,
  I. S. Consistent binary classification with generalized         A. Learning with noisy labels. Advances in neural infor-
  performance metrics. In NIPS, pp. 2744–2752, 2014.              mation processing systems, 26:1196–1204, 2013.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers     Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and
  of features from tiny images. 2009.                             Qu, L. Making deep neural networks robust to label noise:
                                                                  A loss correction approach. In Proceedings of the IEEE
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet         Conference on Computer Vision and Pattern Recognition,
  classification with deep convolutional neural networks.         pp. 1944–1952, 2017.
  Advances in neural information processing systems, 25:
  1097–1105, 2012.                                              Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to
                                                                  reweight examples for robust deep learning. In Interna-
Lahoti, P., Gummadi, K. P., and Weikum, G. ifair: Learn-          tional Conference on Machine Learning, pp. 4334–4343.
  ing individually fair data representations for algorithmic      PMLR, 2018.
  decision making. In 2019 IEEE 35th International Con-
  ference on Data Engineering (ICDE), pp. 1334–1345.            Shimodaira, H. Improving predictive inference under covari-
  IEEE, 2019.                                                     ate shift by weighting the log-likelihood function. Jour-
                                                                  nal of statistical planning and inference, 90(2):227–244,
Lewis, D. Evaluating and optimizing autonomous text clas-         2000.
  sification systems. In SIGIR, 1995.
                                                                Stewart, G. W. Perturbation theory for the singular value
Lipton, Z., Wang, Y.-X., and Smola, A. Detecting and              decomposition. Technical report, 1998.
  correcting for label shift with black box predictors. In
                                                                Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von
  International conference on machine learning, pp. 3122–
                                                                  Bünau, P., and Kawanabe, M. Direct importance estima-
  3130. PMLR, 2018.
                                                                  tion for covariate shift adaptation. Annals of the Institute
McInnes, L., Healy, J., Saul, N., and Großberger, L. Umap:        of Statistical Mathematics, 60(4):699–746, 2008.
 Uniform manifold approximation and projection. Journal
                                                                Tamburrelli, G. and Margara, A. Towards automated A/B
 of Open Source Software, 3(29):861, 2018.
                                                                  testing. In International Symposium on Search Based
Menon, A. K., Van Rooyen, B., and Natarajan, N. Learning          Software Engineering, pp. 184–198. Springer, 2014.
 from binary labels with instance-dependent noise. Ma-          Wang, J., Liu, Y., and Levy, C. Fair classification with group-
 chine Learning, 107(8-10):1561–1595, 2018.                      dependent label noise. arXiv preprint arXiv:2011.00379,
Narasimhan, H. Learning with complex loss functions and          2020a.
  constraints. In International Conference on Artificial        Wang, S., Guo, W., Narasimhan, H., Cotter, A., Gupta, M.,
  Intelligence and Statistics, pp. 1646–1654, 2018.              and Jordan, M. I. Robust optimization for fairness with
Narasimhan, H., Vaish, R., and Agarwal, S. On the statistical    noisy protected groups. 2020b.
  consistency of plug-in classifiers for non-decomposable       Yan, B., Koyejo, S., Zhong, K., and Ravikumar, P. Binary
  performance measures. In Advances in Neural Informa-            classification with karmic, threshold-quasi-concave met-
  tion Processing Systems, pp. 1493–1501, 2014.                   rics. In International Conference on Machine Learning,
Narasimhan, H., Kar, P., and Jain, P. Optimizing non-             pp. 5531–5540. PMLR, 2018.
  decomposable performance measures: A tale of two              Yang, F., Cisse, M., and Koyejo, S. Fairness with overlap-
  classes. In International Conference on Machine Learn-          ping groups. 2020.
  ing, pp. 199–208. PMLR, 2015a.
                                                                Ye, N., Chai, K. M., Lee, W. S., and Chieu, H. L. Optimizing
Narasimhan, H., Ramaswamy, H., Saha, A., and Agarwal,             f-measures: a tale of two approaches. In Proceedings of
  S. Consistent multiclass algorithms for complex perfor-         the 29th International Conference on Machine Learning,
  mance measures. In ICML, pp. 2398–2407, 2015b.                  pp. 289–296. Omnipress, 2012.
Narasimhan, H., Cotter, A., and Gupta, M. Optimizing            Zhao, S., Fard, M. M., Narasimhan, H., and Gupta, M.
  generalized rate metrics with three players. In Advances        Metric-optimized example weights. In International Con-
  in Neural Information Processing Systems, pp. 10746–            ference on Machine Learning, pp. 7533–7542. PMLR,
 10757, 2019.                                                     2019.

You can also read