Robust Bayesian Classification Using an Optimistic Score Ratio

Page created by Andy Mendoza
 
CONTINUE READING
Robust Bayesian Classification Using an Optimistic Score Ratio
Robust Bayesian Classification Using an Optimistic Score Ratio

                                        Viet Anh Nguyen 1 Nian Si 1 Jose Blanchet 1

                          Abstract                                  loss function can be easily reached, the choice of a class
                                                                    prior and a class-conditional distribution (i.e., the likelihood
     We build a Bayesian contextual classification                  given the class), two compulsory inputs to the Bayesian ma-
     model using an optimistic score ratio for robust               chinery to devise the posterior, is more difficult to be agreed
     binary classification when there is limited infor-             upon due to conflicting beliefs among involving parties and
     mation on the class-conditional, or contextual,                limited available data.
     distribution. The optimistic score searches for the
     distribution that is most plausible to explain the             Robust Bayesian statistics, which explicitly aims to as-
     observed outcomes in the testing sample among                  semble a posterior inference model with multiple priors
     all distributions belonging to the contextual am-              and/or multiple class-conditional distributions, emerges as
     biguity set which is prescribed using a limited                a promising remedy to this longstanding problem. Exist-
     structural constraint on the mean vector and the               ing research in this field mainly focuses on robust diver-
     covariance matrix of the underlying contextual                 gences in the general Bayesian inference framework. Walker
     distribution. We show that the Bayesian classifier             (2013) identifies the behaviour of Bayesian updating in the
     using the optimistic score ratio is conceptually               context of model misspecification to show that standard
     attractive, delivers solid statistical guarantees and          Bayesian updating method learns a model that minimizes
     is computationally tractable. We showcase the                  the Kullback-Leibler (KL) divergence to the true data gener-
     power of the proposed optimistic score ratio clas-             ating model. To achieve robustness in Bayesian inference,
     sifier on both synthetic and empirical data.                   existing works often target robust divergences, including
                                                                    maximum mean discrepancy, Rényi’s alpha-divergences,
                                                                    Hellinger-based divergences, and density power divergence
                                                                    (Chérief-Abdellatif & Alquier, 2019; Knoblauch et al., 2019;
1. Introduction
                                                                    Bissiri et al., 2016; Jewson et al., 2018; Ghosh & Basu,
We consider a binary classification setting in which we are         2016). Learning the learning rate in the general Bayesian in-
provided with training samples from two classes but there           ference framework is also gaining more recent attention
is little structure within the classes, e.g., data with heteroge-   (Holmes & Walker, 2017; Knoblauch, 2019). Besides,
neous distributions except for means and covariance. The            Miller & Dunson (2019) use approximate Bayesian compu-
ultimate goal is to correctly classify an unlabeled test sam-       tation to obtain a ’coarsened’ posterior to achieve robustness
ple of a given feature. This supervised learning task is the        and Grünwald (2012) proposes a safe Bayesian method.
cornerstone of modern machine learning, and its diverse
                                                                    Despite being an active research field, alleviating the impact
applications are flourishing in promoting healthcare (Naraei
                                                                    of the model uncertainty in the class-conditional distribu-
et al., 2016; Tomar & Agarwal, 2013), speeding up techno-
                                                                    tion (i.e., the likelihood conditional on the class) using ideas
logical progresses (Rippl et al., 2016; Zhu et al., 2014), and
                                                                    from distributional robustness is left largely unexplored even
improving societal values (Bhagat et al., 2011; Bodendorf
                                                                    though this uncertainty arises naturally for numerous rea-
& Kaiser, 2009). Confronting the unstructured nature of
                                                                    sons. Even if we assume a proper parametric family, the
the problem, it is natural to exercise a Bayesian approach
                                                                    plug-in estimator still carries statistical error from finite
which employs subjective belief and available information,
                                                                    sampling and rarely matches the true distribution. The un-
and then determine an optimal classifying decision that min-
                                                                    certainty is amplified when one relaxes to the nonparametric
imizes a certain loss function integrated under the posterior
                                                                    setting where no hardwired likelihood specification remains
distribution. Although a consensus on the selection of the
                                                                    valid, and we are not aware of any guidance on a reasonable
  1
    Stanford University. Correspondence to: Viet Anh Nguyen         choice of a likelihood in this case. The situation deteriorates
.                                     further when the training data violates the independent or
                                                                    identically distributed assumptions, or when the test distri-
Proceedings of the 37 th International Conference on Machine        bution differs from the training distribution as in the setting
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by
the author(s).                                                      of covariate shift (Gretton et al., 2009; Bickel et al., 2009;
Robust Bayesian Classification Using an Optimistic Score Ratio
Robust Bayesian Classification Using an Optimistic Score Ratio

Moreno-Torres et al., 2012).                                       ambiguity sets B0 and B1 can be written as
We endeavor in this paper to provide the precise mathemati-            
                                                                              ∃f0 ∈ Bρ0 (Pb0 ) ∩ P, f1 ∈ Bρ (P
                                                                                                                         
                                                                                                              b1 ) ∩ P : 
                                                                                                           1
cal model for binary classification with uncertain likelihood          
                                                                                                       fc (x)πc
                                                                                                                         
                                                                                                                         
under the Bayesian decision analysis framework. Consider            B = P : P(Y = c|X = x) = X                         ∀c ,
a binary classification setting where Y ∈ {0, 1} represents
                                                                       
                                                                       
                                                                                                           fc0 (x)πc0   
                                                                                                                         
                                                                                                                         
                                                                                                      c0 ∈{0,1}
the random class label and X ∈ Rd represents the random
features. With a new observation x to be classified, we con-       where the constraint in the set B links the class-conditional
sider the problem of finding an optimal action a ∈ {0, 1},         densities fc and the prior distribution of the class propor-
                (                                                  tions π to the posterior distribution. Facing with the un-
                 0       if classify x in class 0,                 certainty in the posterior distributions, it is reasonable to
             a=                                                    consider now the distributionally robust problem
                 1       if classify x in class 1,
                                                                     min sup aP(Y = 0|X = x) + (1 − a)P(Y = 1|X = x),
to minimize the probability of misclassification by solving        a∈{0,1} P∈B

the optimization problem                                                                                                     (1)

                                                                   where the action a is chosen so as to minimize the worst-case
  min aP(Y = 0|X = x) + (1 − a)P(Y = 1|X = x),                     mis-classification probability over all posterior distribution
 a∈{0,1}
                                                                   P ∈ B. The next proposition asserts that the optimal action
where P(Y |X = x) denotes the posterior probability. If            a? belongs to the class of ratio decision rule.
a class-proportion prior π and the class-conditional (para-        Proposition 1.1 (Optimal action). The optimal action that
metric) densities f0 and f1 are known, then this posterior         minimizes the worst-case mis-classification probability (1)
probability can be calculated by using the Bayes’ theo-            has the form
rem (Schervish, 1995, Theorem 1.31). Unluckily, we rarely
                                                                                       supf1 ∈Bρ (bP1 )∩P f1 (x)
                                                                              
have access to the true conditional densities in real life.
                                                                                                                 ≥ τ (x),
                                                                              
                                                                              1 if             1
                                                                          ?
                                                                         a =           supf0 ∈Bρ (bP0 )∩P f0 (x)
To tackle this problem in the data-driven setting, for any                                     0

                                                                                0 otherwise,
                                                                              
class c ∈ {0, 1}, the decision maker first forms, to the best
of its belief and on the availability of data, a nominal class-
conditional distribution P bc . We assume now that the true        for some threshold τ > 0 that is dependent on x.
class-conditional distribution belongs to an ambiguity set         Motivated by this insight from the parametric setting, we
Bρc (Pbc ), defined as a ball, prescribed via an appropriate
                                                                   now promote the following classification decision rule
measure of dissimilarity, of radius ρc ≥ 0 centered at the
                                                                                         (
nominal distribution P  bc in the space of class-conditional
                                                                                           1 if R(x) ≥ τ (x),
probability measures. Besides, we allow to constrain the                        C(x) =
                                                                                           0 otherwise,
class-conditional distributions to lie in a subspace P of
probability measures to facilitate the injection of optional       where R(x) is the ratio defined as
parametric information, should the need arise.
                                                                                               sup         `(x, Q)
To avoid any unnecessary measure theoretic complications,                                  Q∈Bρ1 (b
                                                                                                  P1 )∩P
we position ourselves temporarily in the parametric setting                      R(x) ,                              ,
                                              bc ) ∩ P para-                                   sup         `(x, Q)
and assume that we can generically write Bρc (P
                                                                                           Q∈Bρ0 (b
                                                                                                  P0 )∩P
metrically as
                                                                   and τ (x) > 0 is a positive threshold which is potentially
        bc ) ∩ P = {fc ( · |θc ) : θc ∈ Θc }
   Bρc (P                                      ∀c ∈ {0, 1},        dependent on the observation x. The score function `(x, Q)
                                                                   quantifies the plausibility of observing x under the probabil-
where Θc are non-empty (sub)sets on the finite-dimensional         ity measure Q, and the value R(x) quantifies how plausible
parameter space Θ, and Θc satisfy the additional regularity        an observation x can be generated by any class-conditional
condition that the density evaluated at point x is strictly        probability distribution in Bρ1 (P b1 ) ∩ P relatively to any
positive, i.e., fc (x|θc ) > 0 for all θc ∈ Θc . Notice that the   distribution in Bρ0 (Pb0 ) ∩ P. Because both the numerator
parametric subspace of probability distributions P is now          and the denominator search for the distribution in the respec-
explicitly described through the set of admissible parameters      tive ambiguity set that maximizes the score of observing x,
Θ. If we denote the prior proportions by π0 = π(Y = 0) >           R(x) is thus termed the ratio of optimistic scores, and the
0 and π1 = π(Y = 1) > 0, then the ambiguity set over               classification decision C is hence called the optimistic score
the posterior distributions induced by the class-conditional       ratio classifier.
Robust Bayesian Classification Using an Optimistic Score Ratio
Robust Bayesian Classification Using an Optimistic Score Ratio

The classifying decision C(x) necessitates the solution of         classification task. These include Bayesian inference using
two optimistic score evaluation problems of the form               synthetic likelihood (Wood, 2010; Price et al., 2018), ap-
                                                                   proximate Bayesian computation (Csilléry et al., 2010; Toni
                        sup       `(x, Q),                 (2)     et al., 2009), variational Bayes inference (Blei et al., 2017;
                     Q∈Bρ (b
                           P)∩P
                                                                   Ong et al., 2018), and composite hypothesis testing using
where the dependence of the input parameters on the label          likelihood ratio (Cox, 1961; 2013). These connections will
c ∈ {0, 1} has been omitted to avoid clutter. The perfor-          be explored in future research.
mance of C depends critically on the specific choice of `          All proofs are relegated to the appendix.
and Bρ (P).
          b Typically, ` is subjectively tailored to the choice
of a parametric or a nonparametric view on the conditional         Notations. We let M be the set of probability measures
distribution, as we shall see later on in this paper. The con-     supported on Rd with finite second moment. The set of
struction of Bρ (P)
                  b is principally governed by choice of the       (symmetric) positive definite matrices is denoted by Sd++ .
dissimilarity measure that specifies the ρ-neighborhood of         For any Q ∈ M, µ ∈ Rd and Σ ∈ Sd++ , we use Q ∼ (µ, Σ)
the nominal distribution P.  b Ideally, Bρ (P)
                                             b should allow a      to express that Q has mean vector µ and covariance matrix
coherent transition between the parametric and nonparamet-         Σ. The d-dimensional identity matrix is denoted by Id .
ric setting via its interaction with the set P. Furthermore, it    The space of Gaussian distributions is denoted by N , and
should render problem (2) computationally tractable with           N (µ, Σ) denotes a Gaussian distribution with mean µ and
meaningful optimal value, and at the same time provide the         covariance matrixΣ. The trace and determinant operator
flexibility to balance between exerting statistical guarantees     are denoted by Tr A and det(A), respectively.
and modelling domain adaptation. These stringent criteria
precludes the utilization of popular dissimilarity measures        2. Moment-based Divergence Ambiguity Set
in the emerging literature. Indeed, the likelihood problem
using the f -divergence (Ben-Tal et al., 2013; Namkoong            We specifically study the construction of the ambiguity set
& Duchi, 2016) delivers unreasonable estimate in the non-          using the following divergence on the space of moments.
parametric setting (Nguyen et al., 2019a, Section 2), the          Definition 2.1 (Moment-based divergence). For any vectors
Wasserstein distance (Mohajerin Esfahani & Kuhn, 2018;             µ1 , µ2 ∈ Rd and matrices Σ1 , Σ2 ∈ Sd++ , the divergence
Kuhn et al., 2019; Blanchet et al., 2019; Gao & Kleywegt,          from the tuple (µ1 , Σ1 ) to the tuple (µ2 , Σ2 ) amounts to
2016; Zhao & Guan, 2018) typically renders the Gaussian
                                                                      D (µ1 , Σ1 ) k (µ2 , Σ2 ) , (µ2 − µ1 )> Σ−1
                                                                                               
parametric likelihood problem non-convex, and the maxi-                                                            2 (µ2 − µ1 )
mum mean discrepancy (Iyer et al., 2014; Staib & Jegelka,                              −1
                                                                                           
                                                                           + Tr Σ1 Σ2 − log det(Σ1 Σ2 ) − d. −1
2019) usually results in an infinite-dimensional optimization
problem which is challenging to solve. This fact prompts us
to explore an alternative construction of Bρ (P) b that meets      To avoid any confusion, it is worthy to note that contrary
the criteria as mentioned above.                                   to the usual utilization of the term ‘divergence’ to specify a
                                                                   dissimilarity measure on the probability space, in this paper,
The contributions of this paper are summarized as follows.         the divergence is defined on the finite-dimensional space of
                                                                   mean vectors and covariance matrices.
• We introduce a novel ambiguity set based on a divergence
defined on the space of mean vector and covariance matrix.         It is straightforward to show that D is a divergence on Rd ×
We show that this divergence manifests numerous favorable          Sd++ by noticing that D is a sum of the log-determinant
properties and evaluating the optimistic score is equivalent       divergence (Chebbi & Moakher, 2012) from Σ1 to Σ2 and
to solving a non-convex optimization problem. We prove             a non-negative Mahalanobis distance between µ1 and µ2
the asymptotic statistical guarantee of the divergence, which      weighed by Σ2 . As a consequence, D is non-negative, and
directs an optimal calibration the size of the ambiguity set.      perishes to 0 if and only if Σ1 = Σ2 and µ1 = µ2 . With
                                                                   this property, D is an attractive candidate for the divergence
• We show that, despite its inherent non-convexity and             on the joint space of mean vector and covariance matrix
hence intractability, the optimistic score evaluation prob-        of d-dimensional random vectors. One can additionally
lem can be efficiently solved in both nonparametric and            verify that D is affine-invariant in the following sense. Let
parametric Gaussian settings. We reveal that the optimistic        ξ be a d-dimensional random vector and ζ be the affine-
score ratio classifier generalizes the Mahalanobis distance        transformation of ξ, that is, ζ = Aξ + b for an invertible
classifier and the linear/quadratic discriminant analysis.         matrix A and a vector b of matching dimensions, then the
                                                                   value of the divergence D is preserved between the space of
Because evaluating the plausibility of an observation x is         moments of ξ and ζ. In fact, if ξ is a random vector with
a fundamental problem in statistics, the results of this pa-       mean vector µj ∈ Rd and covariance matrix Σj ∈ Sd++ ,
per have far-reaching implications beyond the scope of the         then ζ has mean Aµj + b and covariance matrix AΣj A>
Robust Bayesian Classification Using an Optimistic Score Ratio
Robust Bayesian Classification Using an Optimistic Score Ratio

for j ∈ {1, 2}, and we have                                                          b then any distribution Q0 with the same
                                                                  Q belongs to Bρ (P),
                                                                 mean vector and covariance matrix with Q also belongs to
    D (µ1 , Σ1 ) k (µ2 , Σ2 )                             (3)     Bρ (P).
                                                                      b Further, Bρ (P)  b embraces all types of probability
                             >                       >            distributions, including discrete, continuous and even mixed
                                             
    = D (Aµ1 + b, AΣ1 A ) k (Aµ2 + b, AΣ2 A ) .
                                                                  continuous/discrete distributions.
A direct consequence is that D is also scale-invariant. Fur-
thermore, the divergence D is closely related to the KL diver-    We now delineate a principled approach to solve the op-
gence1 , or the relative entropy, between two non-degenerate      timistic score evaluation problem (2) for a generic score
Gaussian distributions as                                         function ` : Rd × M → R. We denote by M(µ, Σ) the
                                                                  Chebyshev ambiguity set that contains all probability mea-
                                                                  sures with fixed mean vector µ ∈ Rd and fixed covariance
                                                           
D (µ1 , Σ1 ) k (µ2 , Σ2 ) = 2 KL N (µ1 , Σ1 ) k N (µ2 , Σ2 ) .
                                                                  matrix Σ ∈ Sd++ , that is,
However, we emphasize that D   is not symmetric, and in gen-
                                                           
eral D (µ1 , Σ1 ) k (µ2 , Σ2 ) 6= D (µ2 , Σ2 ) k (µ1 , Σ1 ) .                 M(µ, Σ) , {Q ∈ M : Q ∼ (µ, Σ)} .
Hence, D is not a distance on Rd × Sd++ .
                                                                  The moment-based divergence ambiguity set Bρ (P)
                                                                                                                b then
For any vector µb ∈ Rd , invertible matrix Σ b ∈ Sd++ and
                                                                  admits an equivalent representation
radius ρ ∈ R+ , we define the uncertainty set Uρ (b
                                                  µ, Σ)
                                                     b over
the mean vector and covariance matrix space as
                                                                                                [
                                                                                Bρ (P)
                                                                                    b =                     M(µ, Σ),
                                                                                           (µ,Σ)∈Uρ (b
                                                                                                     µ,Σ)
  Uρ (b
      µ, Σ)
         b ,                                                                                           b
                                                 (4)
    {(µ, Σ) ∈ Rd × Sd++ : D (b
                                          
                                b k (µ, Σ) ≤ ρ}.
                             µ, Σ)                                which is an infinite union of Chebyshev ambiguity sets,
                                                                  where the union operator is taken over all tuples of mean
By definition, Uρ (b
                   µ, Σ)
                      b includes all tuples (µ, Σ) which          vector-covariance matrix belonging to Uρ (bµ, Σ).
                                                                                                                b Leverag-
is of a divergence not bigger than ρ from the tuple (b
                                                     µ, Σ).
                                                        b         ing on this representation, problem (2) can now be decom-
Because D is not symmetric, it is important to note that          posed as a two-layer optimization problem
Uρ (b
    µ, Σ)
       b is defined with the tuple (b µ, Σ)
                                         b being the first
argument of the divergence D, and this uncertainty set can           sup      `(x, Q) =       sup              sup     `(x, Q).
                                                                   Q∈Bρ (b                (µ,Σ)∈Uρ (b b Q∈M(µ,Σ)∩P
                                                                                                    µ,Σ)
be written in a more expressive form as                                  P)
                                                                                                                              (6)
Uρ (b
    µ, Σ)
       b =                                                        The inner subproblem of (6) is a distributionally robust
                                                                  optimization problem with a Chebyshev second moment
   (µ, Σ) ∈ Rd × Sd++ :
                                                  
                                                                  ambiguity set, hence there is a strong potential to exploit ex-
        b)> Σ−1 (µ − µ        b −1 + log det Σ ≤ ρ
                                 
   (µ − µ            b) + Tr ΣΣ
                                                                  istent results from the literature, see Delage & Ye (2010) and
                                                                  Wiesemann et al. (2014), to reformulate this inner problem
for a scalar ρ , ρ + d + log det Σ.b Moreover, one can as-
                                                                  into a finite dimensional convex optimization problem. Un-
sert that Uρ (b
              µ, Σ) is non-convex due to the log-determinant
                 b
                                                                  fortunately, the outer subproblem of (6) is a robust optimiza-
term, and this non-convexity cannot be eliminated using
                                                                  tion problem over a non-convex uncertainty set Uρ (b    µ, Σ),
                                                                                                                              b
the reparametrization to the space of inverse covariance
                                                                  thus the two-layer decomposition problem (6) remains com-
matrices (or equivalently called the precision matrices).
                                                                  putationally intractable in general. As a direct consequence,
Equipped with Uρ (b µ, Σ),
                       b the ambiguity set Bρ (P) b is system-    solving the optimistic score evaluation problem requires an
atically constructed as follows. If the nominal distribution P
                                                             b    intricate adaptation of non-convex optimization techniques
admits a nominal mean vector µ  b and a nominal nondegener-       applied on a case-by-case basis. Two exemplary settings in
ate covariance matrix Σ,b then Bρ (P)b is a ball that contains    which problem (6) can be efficiently solved will be depicted
all probability measures whose mean vector and covariance         subsequently in Sections 3 and 4.
matrix are contained in Uρ (bµ, Σ),
                                b that is,
                                                                  We complete this section by providing the asymptotic sta-
     b , {Q ∈ M : Q ∼ (µ, Σ), (µ, Σ) ∈ Uρ (b                      tistical guarantees of the divergence D, which serves as a
 Bρ (P)                                    µ, Σ)}.
                                              b    (5)
                                                                  potential guideline for the construction of the ambiguity set
                                                                  Bρ (P)
                                                                       b and the tuning of the radius parameter ρ.
The set Bρ (P),
            b by construction, differentiates only through
the information about the first two moments: if a distribution    Theorem 2.2 (Asymptotic guarantee of D). Suppose that
   1
                                                                  a d-dimensional random vector ξ has mean vector m ∈
     If Q1 is absolutely continuous with respect to Q2 , then
the Kullback-Leibler divergence from Q1 to Q2 amounts to
                                                                  Rd , covariance matrix S ∈ Sd++ and admits finite fourth
KL(Q1 k Q2 ) , EQ1 [log dQ1 /dQ2 ], where dQ1 /dQ2 is the         moment under a probability measure P. Let ξbt ∈ Rd , t =
Radon-Nikodym derivative of Q1 with respect to Q2 .               1, . . . , n be independent and identically distributed samples
Robust Bayesian Classification Using an Optimistic Score Ratio

                       bn ∈ Rd and Σ
of ξ from P. Denote by µ           b n ∈ Sd+ the sample                    probability. While the limiting distribution under the Gaus-
mean vector and sample covariance matrix defined as                        sian setting is a typical chi-squaredistribution, the general
                                                                           limiting distribution H > H + Tr Z 2 /2 in Theorem 2.2
                                                                                                                    
             n                       n
         1 Xb             bn = 1
                                 X                                         does not have any analytical form. This limiting distribution
 µ
 bn =          ξt ,       Σ          (ξbt − µ          bn )> . (7)
                                            bn )(ξbt − µ
         n t=1                 n t=1                                       can be numerically approximated, for example, via Monte
                                                                           Carlo simulations. If the i.i.d. assumption of the training
                 1
Let η = S − 2 (ξ − m) be the isotropic transformation of                   samples is violated or if we expect a covariate shift at test
the random vector ξ, let H be a d-dimensional Gaussian                     time, then the radius ρ reflects the modeler’s belief regard-
random vector with mean vector 0 and covariance matrix                     ing the moment mismatch measured using the divergence
Id , and let Z be a d-by-d random symmetric matrix with the                D and in this case, the radius ρ should be considered as an
upper triangle component Zjk (j ≤ k) following a Gaussian                  exogenous input to the problem.
distribution with mean 0 and the covariance coefficient                    For illustrative purpose, we fix dimension d = 20 and con-
between Zjk and Zj 0 k0 is                                                 sider the random vector ξ = Cζ + m, where entries of ζ
                                                                           are mutually independent and the i-th entry follows a nor- √
cov(Zjk , Zj 0 k0 ) = EP [ηj ηk ηj 0 ηk0 ] − EP [ηj ηk ] EP [ηj 0 ηk0 ].   malized chi-square distribution, i.e., ζi ∼ (χ2 (1) − 1)/ 2.
                                                                                                                          >
                                                                           Then the covariance matrix of ξ is S = CC      . Notice that
Furthermore, H and Z are jointly Gaussian distributed with                 by the identity (3), D (bµn , Σn ) k (m, S) is invariant of
                                                                                                         b
the covariance between Hi and Zjk as                                       the choice of C and m. We generate 10,000 datasets, each
                                                                           contains n i.i.d. samples of ξ and calculate for each
                                                                                                                               dataset
                     cov(Hi , Zjk ) = EP [ηi ηj ηk ].
                                                                           the empirical values of n × D (b     b n ) k (m, S) . We plot
                                                                                                           µn , Σ
                                                                                   1 the empirical distribution of n × D (b
As n ↑ ∞, we have                                                          in Figure                                               b n) k
                                                                                                                              µn , Σ
                                                                          (m, S) using 10,000   datasets versus the limiting distribu-
      n × D (b    b n ) k (m, S)                                           tion of H > H + Tr Z 2 /2 for different values of n. One
                                                                                                    
             µn , Σ
                           1                                             can observe that for a small sample size (n < 100), there is
           −→ H > H + Tr Z 2 in distribution.                              a perceivable difference between the finite sample distribu-
                           2
                                                                           tion and the limiting distribution, but as n becomes larger
We were not able to locate Theorem 2.2 in the existing liter-              (n > 100), this mismatch is significantly reduced.
ature. Interestingly, Theorem 2.2 also sheds light upon the
asymptotic behavior of the KL divergence from an empiri-
cal Gaussian distribution N (b
                             µn , Σ
                                  b n ) to the data-generating
Gaussian distribution N (m, S).
Corollary 2.3 (Asymptotic guarantee of D – Gaussian dis-
tributions). Suppose that ξbt ∈ Rd , t = 1, . . . , n are in-
dependent and identically distributed samples of ξ from
P = N (m, S) for some m ∈ Rd and S ∈ Sd++ . Let
bn ∈ Rd and Σ
µ              b n ∈ Sd+ be the sample mean vector and
covariance matrix defined as in (7). As n ↑ ∞, we have                                (a) n = 30                     (b) n = 100
                                        
       n × KL N (b      b n ) k N (m, S)
                   µn , Σ
                1
            −→ χ2 (d(d + 3)/2) in distribution,
                2
where χ2 (d(d + 3)/2) is a chi-square distribution with
d(d + 3)/2 degrees of freedom.

If we use independent and identically distributed (i.i.d.)
samples to estimate the nominal mean vector and covari-                               (c) n = 300                   (d) n = 1000
               b then the radius ρ should be asymptotically
ance matrix of P,                                                                                                                            
scaled at the rate n−1 as the sample size n increases. In-                 Figure 1. Empirical distribution of n × D (b       b n ) k (m, S)
                                                                                                                         µn , Σ
deed, Theorem 2.2 and Corollary 2.3 suggest that n−1 is                    collected from 10,000 datasets (orange
                                                                                                               histogram) versus the lim-
the optimal asymptotic rate which ensures that the true but                iting distribution H > H + 12 Tr Z 2 obtained by Monte Carlo
                                                                           simulations (blue curve) for different sample sizes n.
unknown mean vector and covariance matrix of the data-
generating distribution fall into the set Uρ (b
                                              µ, Σ)
                                                 b with high
Robust Bayesian Classification Using an Optimistic Score Ratio

3. Optimistic Nonparametric Score                                    Because the feasible set Bρ (P)b is not weakly compact, the
                                                                     existence of an optimal measure that solves the optimistic
We consider in this section the nonparametric setting in             likelihood problem on the left-hand side of (8a) is not trivial.
which no prior assumption on the class-conditional distri-           However, equation (8a) asserts that this optimal measure
bution is imposed. A major difficulty in this nonparametric          exists, and it can be constructed by solving a non-convex op-
setting is the elicitation of a reasonable score function ` that     timization problem over the mean vector-covariance matrix
can coherently encapsulate the plausibility of observing x           tuple (µ, Σ). Notice that (8a) is a non-convex optimization
over the whole spectrum of admissible Q, including contin-
                                                                     problem because Uρ (b   µ, Σ)
                                                                                                b is a non-convex set. Surpris-
uous, discrete and mixed continuous/discrete distributions,
                                                                     ingly, one can show that the optimizer of (8a) can be found
while at the same time being amenable for optimization
                                                                     semi-analytically: the maximizer (µ? , Σ? ) depends only on
purposes. Taking this fact into consideration, we thus posit
                                                                     a single scalar γ ? through (8c), where γ ? solves the uni-
to choose the score function of the form
                                                                     variate optimization problem (8d). Because problem (8d) is
                                                                     convex, γ ? can be efficiently found using a bisection algo-
                         `(x, Q) ≡ Q({x}),
                                                                     rithm or using a Newton-Raphson method, and we expose
which is the probability value of the singleton, measurable          in Appendix E the first- and second-order derivative of the
set {x} under the measure Q. If Q is a continuous distri-            objective function of (8d).
bution, then apparently Q({x}) is zero, hence this score             A nonparametric classifier Cnonparam can be formed by utiliz-
function is admittedly not perfect. Nevertheless, it serves as       ing the optimistic nonparametric score ratio
a sensible proxy in the nonparametric setting and delivers
competitive performance in machine learning tasks (Nguyen                                                  sup        Q({x})
                                                                                                        Q∈Bρ1 (b
                                                                                                               P1 )
et al., 2019b). It is reasonable to set P ≡ M in the non-                          Rnonparam (x) ,                             ,       (9)
parametric setting, and with this choice of `, the optimistic                                              sup        Q({x})
nonparametric score evaluation problem becomes                                                          Q∈Bρ0 (b
                                                                                                               P0 )

                            sup     Q({x}),                          where each nominal class-conditional distribution P     bc has
                         Q∈Bρ (b
                               P)                                    mean vector µ  bc ∈ R and covariance matrix Σc ∈ Sd++ ,
                                                                                           d                            b
                                                                     and each ambiguity set Bρc (P   bc ) is defined as in (5). The
which is inherently challenging because it is an infinite-           results of Theorem 3.1 can be used to compute the numer-
dimensional optimization problem. The next theorem as-               ator and denominator of (9), thus the classification deci-
serts that solving the nonparametric optimistic likelihood           sion Cnonparam (x) can be efficiently evaluated. In particular,
optimization problem is equivalent to solving a univariate           by substituting the expression (8a) into (9), we also find
convex optimization problem.
Theorem 3.1 (Optimistic nonparametric probability). Sup-                     Rnonparam (x)
          b ∼ (b
pose that P    µ, Σ)         b ∈ Rd and Σ
                  b for some µ           b ∈ Sd++ . For                            max            [1 + (µ − x)> Σ−1 (µ − x)]−1
any ρ ∈ R+ , we have                                                         (µ,Σ)∈Uρ1 (b
                                                                                        µ1 ,Σ
                                                                                            b1)
                                                                         =
                                                                                   max            [1 + (µ − x)> Σ−1 (µ − x)]−1
                                                                             (µ,Σ)∈Uρ0 (b
                                                                                        µ0 ,Σ
                                                                                            b0)
          sup      Q({x})
        Q∈Bρ (b
              P)                                                             1+         min            (µ − x)> Σ−1 (µ − x)
                                                                                  (µ,Σ)∈Uρ0 (b
                                                                                             µ0 ,Σ
                                                                                                 b0)
    =       max          [1 + (µ − x)> Σ−1 (µ − x)]−1       (8a)         =                                                         .
        (µ,Σ)∈Uρ (b
                  µ,Σ)
                    b                                                        1+         min            (µ − x)> Σ−1 (µ − x)
                                                                                  (µ,Σ)∈Uρ1 (b
                                                                                             µ1 ,Σ
                                                                                                 b1)
    =[1 + (µ? − x)> (Σ? )−1 (µ? − x)]−1 ,                   (8b)
                                                                     Suppose that ρ0 = ρ1 = 0 and τ (x) = 1, then the nonpara-
where (µ? , Σ? ) ∈ Rd × Sd++ satisfies                               metric classifier assigns Cnonparam (x) = 1 whenever

                x + γ?µ                                                 (b        b −1 (b
                                                                         µ1 − x)> Σ                         b −1 (b
                                                                                                   µ0 − x)> Σ
                                                                                        µ1 − x) ≤ (b              µ0 − x),
            µ? =
                       b
                         ,                                                          1                         0
                 1 + γ?
                                                            (8c)     and Cnonparam (x) = 0 otherwise. In this case, the classi-
                         1
           Σ? = Σ
                b+             (x − µ      b)> ,
                                    b)(x − µ                         fier coincides with the class-specific Mahalanobis distance
                    (1 + γ ? )
                                                                     classifier (MDC) where Σ   b 0 and Σb 1 denote the intra-class
and γ ? ∈ R+ solves the univariate convex optimization               nominal covariance matrices. If in addition the nominal co-
problem                                                              variance matrices are homogeneous, that is, Σ  b0 = Σb 1 , then
                                                                     this classifier coincides with the Linear Discriminant Anal-
                           b)> Σ
                       (x − µ   b −1 (x − µ
                                          b)                        ysis (LDA) (Murphy, 2012, Section 4.2.2). The Bayesian
    min γρ − γ log 1 +                         .            (8d)     version of LDA can be equivalently obtained from Cnonparam
    γ≥0                       1+γ
Robust Bayesian Classification Using an Optimistic Score Ratio

by setting a proper value of τ (x). This important observa-       For any ρ ∈ R+ , we have
tion reveals an intimate link between our proposed classifier
Cnonparam using the optimistic nonparametric score ratio and               sup        L(x, Q)
                                                                        Q∈Bρ (b
                                                                              P)∩N
the popular classifiers MDC and LDA. On the one hand,
Cnonparam can now be regarded as a generalization of MDC            =       max           − (µ − x)> Σ−1 (µ − x) − log det Σ
and LDA, which takes into account the statistical impre-                (µ,Σ)∈Uρ (b
                                                                                  µ,Σ)
                                                                                    b

cision of the estimated moments and/or the potential shift                                                                       (11a)
in the moment statistics in test data versus training data                   ?        >    ? −1        ?                    ?
                                                                    = − (µ − x) (Σ )                (µ − x) − log det Σ ,        (11b)
distributions. On the other hand, both MDC and LDA now
admit a nonparametric, generative interpretation in which         where (µ? , Σ? ) ∈ Rd × Sd++ satisfies
the class-conditional distribution is chosen in the set of all
distributions with the same first- and second-moments as                  x + γ?µ
                                                                     µ? =
                                                                                 b
                                                                                   ,
the nominal class-conditional measure P   bc . This novel in-              1 + γ?
                                                                                                                                 (11c)
terpretation goes beyond the classical Gaussian model, and                  γ? b        γ?
                                                                     Σ? =        Σ+              (x − µ      b)> ,
                                                                                                      b)(x − µ
it potentially explains the versatile performance of MDC                  1+γ  ?     (1 + γ ? )2
and LDA when the conditional distribution are not normally
distributed as empirically observed in (Lee et al., 2018).        and γ ? ∈ R+ solves the univariate convex optimization
                                                                  problem
4. Optimistic Gaussian Score                                            n                      1
                                                                   min γρ + d(γ + 1) log 1 +
                                                                    γ≥0                        γ
We now consider the optimistic score evaluation problem                                                              (11d)
under a parametric setting. For simplicity, we assume that
                                                                                             b)> Σ
                                                                                         (x − µ    b −1 (x − µ
                                                                                                             b) o
                                                                       − (1 + γ) log 1 +                           .
the true class-conditional distributions of the feature belong                                  (1 + γ)
to the family of Gaussian distributions. Thus, a natural
choice of the score value `(x, Q) in this case is the Gaus-       Notice that we impose the condition P  b ∼ N (b µ, Σ)
                                                                                                                     b in The-
sian likelihood of an observation x when Q is a Gaussian          orem 4.1 to conform with the belief that the true data gen-
distribution with mean µ and covariance matrix Σ, that is,        erating distribution is Gaussian. This condition, in fact,
                           (x − µ)> Σ−1 (x − µ)                 can be removed without affecting the result presented in
              1
`(x, Q) = p            exp −                       .              Theorem 4.1. Indeed, for any radius ρ ≥ 0, the ambigu-
           (2π)d det Σ               2                            ity set Bρ (P)
                                                                              b by definition contains a Gaussian distribution
                                                                  with the same mean vector and covariance matrix with the
It is also suitable to set P in problem (2) to the (sub)space     nominal distribution P,b and thus the feasible set of (10) is
of Gaussian distributions N and consider the following            always non-empty and the value of the optimistic Gaussian
optimistic Gaussian score evaluation problem                      log-likelihood is always finite. In Appendix E, we provide
                                                                  the first- and second-order derivatives of the objective func-
                        sup      `(x, Q).                (10)     tion of (11d), which can be exploited to derive efficient
                    Q∈Bρ (b
                          P)∩N
                                                                  algorithm to solve the convex program (11d).
One can verify that the maximizer of problem (10) coincides       Returning to the construction of the classifier, one can now
with the maximizer of                                             construct the classifier CN (x) using the optimistic Gaussian
                                                                  score ratio RN (x) expressed by
                       sup       L(x, Q),
                    Q∈Bρ (b
                          P)∩N                                                              sup            `(x, Q)
                                                                                      Q∈Bρ1 (b
                                                                                             P1 )∩N
                                                                         RN (x) ,
where L is the translated Gaussian log-likelihood defined as                                sup            `(x, Q)
                                                                                      Q∈Bρ0 (b
                                                                                             P0 )∩N
                                     
                             d                                                             
                                                                                                1
                                                                                                                            
     L(x, Q) = 2 log `(x, Q) + log(2π)                                                exp       2          sup       L(x, Q)
                              2                                                                     Q∈Bρ1 (b
                                                                                                           P1 )∩N
              = −(µ − x)> Σ−1 (µ − x) − log det Σ.                                =                                        ,    (12)
                                                                                                1
                                                                                      exp       2          sup       L(x, Q)
                                                                                                    Q∈Bρ0 (b
                                                                                                           P0 )∩N
Theorem 4.1 is a counterpart to the optimistic nonparametric
likelihood presented in Theorem 3.1.                              where each nominal distribution P
                                                                                                  bc is a Gaussian distri-
Theorem 4.1 (Optimistic Gaussian log-likelihood). Sup-                                    bc ∈ Rd and covariance matrix
                                                                  bution with mean vector µ
          b ∼ N (b
pose that P      µ, Σ)         b ∈ Rd and Σ
                    b for some µ             b ∈ Sd .
                                                   ++
                                                                  b c ∈ Sd , and each ambiguity set Bρ (P
                                                                  Σ      ++                            c
                                                                                                         bc ) is defined as
Robust Bayesian Classification Using an Optimistic Score Ratio

in (5). Theorem 4.1 can be readily applied to evaluate the           for optimistic Gaussian decision rule, the decision bound-
value RN (x), and classify x using CN (x). Furthermore,              aries look similar across different radii; while in nonpara-
suppose that ρ0 = ρ1 = 0 and τ (x) = 1, then the result-             metric case, the decision boundaries exhibit different shapes.
ing classifier recovers the Quadratic Discriminant Analy-            We then consider the case with distinct radii by setting
sis (Murphy, 2012, Section 4.2.1). The Bayesian version of           ρ
                                                                     ~ = (ρ0 , ρ1 ) = (0.1, 1.0) and ρ~ = (ρ0 , ρ1 ) = (1.0, 0.1).
the QDA can be equivalently obtained from CN by setting a            Further, we fix the threshold to a constant τ (x) = τ ? for a
proper value for τ (x).                                              scalar τ ? ∈ R+ that solves
It is imperative to elaborate on the improvement of Theo-                        n0
                                                                                 X                         n1
                                                                                                           X
rem 4.1 compared to the result reported in Nguyen et al.                 max               x0,i ) < τ }+
                                                                                       1{R(b                         x1,i ) ≥ τ },
                                                                                                                 1{R(b               (13)
                                                                          τ ≥0
(2019a, Section 3). While both results are related to the eval-                  i=1                       i=1
uation of the optimistic Gaussian log-likelihood, Nguyen
et al. (2019a, Theorem 3.2) restricts the mean vector to its         where 1{ · } is the indicator function. The decision bound-
nominal value and optimizes only over the covariance ma-             aries are plotted in Figure 3. We find the decision boundaries
trix. On the other hand, Theorem 4.1 of this paper optimizes         have different shapes for different decision rules and for dif-
over both the mean vector and the covariance matrix, thus            ferent choices of radii.
provides full flexibility to choose the optimal values of all
sufficient statistics of the family of Gaussian distributions.
From a technical standpoint, the non-convexity is overcome
in Nguyen et al. (2019a, Theorem 3.2) through a simple
change of variables; nonetheless, the proof of Theorem 4.1
demands an additional layer of duality arguments to disen-
tangle the multiplicative terms between µ and Σ in both the
objective function L and the divergence D. By inspecting
the expressions in (11c), one can further notice that in gen-
eral the optimal solution µ? is distinct from the nominal
mean µ  b, this observation suggests that optimizing jointly               (a) Gaussian, ρb = 0.5          (b) Nonparametric, ρb = 0.5
over (µ, Σ) is indeed more powerful than optimizing simply
over Σ from a theoretical perspective.

5. Numerical Results
All experiments are run on a standard laptop with 1.4 GHz
Intel Core i5 and 8GB of memory, the codes and datasets are
available at https://github.com/nian-si/bsc.

5.1. Decision Boundaries
In this section, we visualize the classification decision                (c) Gaussian, ρb = 0.7         (d) Nonparametric, ρb = 0.7
boundaries generated by the classifiers Cnonparam proposed
in Section 3 and CN proposed in Section 4 using synthetic          Figure 2. Decision boundaries for different ρb. Red/blue regions
data. To ease the exposition, we consider a two dimensional        indicate the class partitions, black dots locate the mean, and white
feature space d = 2 and the class-conditional distributions        dashed ellipsoids draw the class-conditional density contours.
are Gaussian of the form
                                                      5.2. Real Data Experiments
                   0                            1    1 0.5
X|Y = 0 ∼ N          ,Id , X|Y = 1 ∼ N            ,              .
                   1                            0 0.5 1            In our experiments, we first compute the nominal mean
We sample i.i.d. data points {b         nc
                                  xc,i }i=1 in each c ∈ {0, 1}     by empirical average and we use the Ledoit-Wolf covari-
with n0 = n1 = 1000 as the training set, then estimate the         ance estimator (Ledoit & Wolf, 2004) to compute a well-
nominal mean µ  bc and the nominal covariance matrix Σc for
                                                         b         conditioned nominal covariance matrix. We experiment two
each class c ∈ {0, 1} using the sample average formula (7).        methods of tuning the radii ρ of the ambiguity sets: using
                                                                   cross-validation on training data, or using the quantile of
We first consider when the ambiguity sets have the same ra-        the limiting distribution in Corollary 2.3. Specifically, for
dius, i.e., ρ0 = ρ1 = ρb and fix the threshold τ (x) = 1 for ev-   the second criteria, we choose
ery x. Figure 2 shows the optimistic Gaussian and nonpara-
metric decision boundaries for ρb ∈ {0.5, 0.7}. We find that                ρc = n−1      2
                                                                                     c χα (d(d + 3)/2)           ∀c ∈ {0, 1},
Robust Bayesian Classification Using an Optimistic Score Ratio

Table 1. Correct classification rate on the benchmark date sets. Bold number corresponds to the best performance in each dataset.

    DATASET                           GQDA, CV          GQDA, CLT           NPQDA, CV     NPQDA, CLT        KQDA      RQDA      SQDA
    AUSTRALIAN                          85.38                84.91            85.84           85.03         85.38      85.61     85.37
    BANKNOTE                            99.83                99.33             99.3           99.83          99.8      99.77     99.83
    CLIMATE MODEL                       94.81                89.33            93.92           90.37         92.59      94.07     93.92
    CYLINDER                            71.04                70.81            71.11           70.89         71.11      71.11     70.67
    DIABETIC                            75.52                73.49            76.09           76.30         75.47      75.00     74.32
    FOURCLASS                           77.82                79.26            80.28           78.98         78.38      79.07     78.84
    HABERMAN                            74.93                75.33            75.45           75.45         74.94      74.80     74.16
    HEART                               81.76                83.09            83.09           81.91         81.76      81.91     83.68
    HOUSING                             91.66                90.55            91.81           91.50         91.66      91.66     91.97
    ILPD                                68.84                69.52            69.25           68.15         69.18      67.94     69.52
    MAMMOGRAPHIC MASS                   80.39                80.00            79.90           79.61         79.95      80.05     80.24

                                                                              Algorithm 1 Optimistic score ratio classification
                                                                                                   xc,i }ni=1
                                                                               1: Input: datasets {b       c
                                                                                                              for c ∈ {0, 1}. A test data x.
                                                                               2: Compute the nominal mean and the nominal covariance
                                                                                  matrix
                                                                               3: Compute the radii ρc ← n−1       2
                                                                                                              c χ0.5 (d(d + 3)/2).
                                                                               4: Compute the optimistic ratio R(b   xc,i ) for every x
                                                                                                                                      bc,i
                                                                               5: Compute the threshold τ ? that solves (13).
                                                                               6: Output: classification label 1{R(x) ≥ τ }.

     (a)   Gaussian, ρ
                     ~ = (0.1, 1.0)   (b)   Nonparametric, ρ
                                                           ~ = (0.1, 1.0)
                                                                              using Algorithm 1 is comparable in terms of test accuracy to
                                                                              classifying with cross-validating on the tuning parameters.
                                                                              We test the performance of our classification rules on vari-
                                                                              ous datasets from the UCI repository (Dua & Graff, 2017).
                                                                              Specifically, we compare the following methods:

                                                                              • Gaussian QDA (GQDA) and Nonparametric QDA
                                                                              (NPQDA): Our classifiers CN and Cnonparam ;
                                                                              • Kullback-Leibler QDA (KQDA): The classifier based on
     (c)   Gaussian, ρ
                     ~ = (1.0, 0.1)   (d)   Nonparametric, ρ
                                                           ~ = (1.0, 0.1)
                                                                              KL ambiguity sets with fixed mean (Nguyen et al., 2019a);

Figure 3. Decision boundaries with distinct radii. Indications are            • Regularized QDA (RQDA): The regularized QDA based
verbatim from Figure 2.                                                       on the linear shrinkage covariance estimator Σ
                                                                                                                           b c + ρc I d ;

where nc is the number of training samples in class c and                     • Sparse QDA (SQDA): The sparse QDA based on the
χ2α (d(d + 3)/2) is the α-quantile of the chi-square distri-                  graphical lasso covariance estimator (Friedman et al., 2008)
bution with d(d + 3)/2 degrees of freedom. Notice that                        with parameter ρc .
for large degrees of freedom, the chi-square distribution
concentrates around the mean, because a chi-square ran-                       For GQDA and NPQDA, we also compare the performance
dom variable with k degrees of freedom is the sum of k                        of different strategies to choose the radii ρ using cross-
i.i.d. χ2 (1). The optimal asymptotic value of the radius                     validation (CV) and selection based on Theorem 2.2 (CLT).
ρc is therefore insensitive to the choice of α, so we select                  For all methods that need cross-validation, we randomly
numerically α = 0.5 in our experiments. We tune the thresh-                   select 75% of the data for training and the remaining 25%
old to maximize the training accuracy following (13) after                    for testing. The size of the ambiguity sets and the reg-
computing the ratio value for each training sample. The                       ularization parameter are selected using stratified 5-fold
whole procedure is summarized in Algorithm 1. In partic-                      cross-validation. Furthermore, to promote a fair compari-
ular, this algorithm trains the parameters using only one                     son, we tune the threshold for every method using (13). The
pass over the training samples, which makes it significantly                  performance of the classifiers is measured by the average
faster than the cross-validation approach. We observe em-                     correct classification rate (CCR) on the validation set. The
pirically in most cases that the performance of classifying                   average CCR score over 10 trials are reported in Table 1.
Robust Bayesian Classification Using an Optimistic Score Ratio

Acknowledgements                                                 Csilléry, K., Blum, M. G., Gaggiotti, O. E., and François, O.
                                                                   Approximate Bayesian Computation (ABC) in practice.
We gratefully acknowledge support from the following NSF           Trends in Ecology & Evolution, 25(7):410 – 418, 2010.
grants 1915967, 1820942, 1838676 as well as the China
Merchants Bank.                                                  Delage, E. and Ye, Y. Distributionally robust optimization
                                                                   under moment uncertainty with application to data-driven
References                                                         problems. Operations Research, 58(3):595–612, 2010.

Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg,       Dua, D. and Graff, C. UCI machine learning repository,
  B., and Rennen, G. Robust solutions of optimization prob-        2017. URL http://archive.ics.uci.edu/ml.
  lems affected by uncertain probabilities. Management
  Science, 59(2):341–357, 2013.                                  Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse
                                                                   covariance estimation with the graphical lasso. Biostatis-
Bhagat, S., Cormode, G., and Muthukrishnan, S. Node                tics, 9(3):432–441, 2008.
  classification in social networks. In Social Network Data
                                                                 Gao, R. and Kleywegt, A. J. Distributionally robust stochas-
  Analytics, pp. 115–148. Springer, 2011.
                                                                   tic optimization with Wasserstein distance. arXiv preprint
Bickel, S., Brückner, M., and Scheffer, T. Discriminative         arXiv:1604.02199, 2016.
  learning under covariate shift. Journal of Machine Learn-      Ghosh, A. and Basu, A. Robust Bayes estimation using
  ing Research, 10(Sep):2137–2155, 2009.                           the density power divergence. Annals of the Institute of
                                                                  Statistical Mathematics, 68(2):413–437, 2016.
Bissiri, P. G., Holmes, C. C., and Walker, S. G. A general
  framework for updating belief distributions. Journal of        Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borg-
  the Royal Statistical Society: Series B (Statistical Method-     wardt, K., and Schölkopf, B. Covariate shift and local
  ology), 78(5):1103–1130, 2016.                                   learning by distribution matching, pp. 131–160. MIT
                                                                   Press, 2009.
Blanchet, J., Kang, Y., and Murthy, K. Robust wasserstein
  profile inference and applications to machine learning.        Grünwald, P. The safe Bayesian. In International Con-
  Journal of Applied Probability, 56(3):830–857, 2019.             ference on Algorithmic Learning Theory, pp. 169–183.
                                                                   Springer, 2012.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-
  tional inference: A review for statisticians. Journal of       Holmes, C. and Walker, S. Assigning a value to a power
  the American Statistical Association, 112(518):859–877,          likelihood in a general Bayesian model. Biometrika, 104
  2017.                                                           (2):497–503, 2017.

Bodendorf, F. and Kaiser, C. Detecting opinion leaders and       Iyer, A., Nath, S., and Sarawagi, S. Maximum mean dis-
  trends in online social networks. In Proceedings of the          crepancy for class ratio estimation: Convergence bounds
  2nd ACM Workshop on Social Web Search and Mining,                and kernel selection. In Proceedings of the 31st Interna-
  pp. 65–68, 2009.                                                 tional Conference on Machine Learning, volume 32, pp.
                                                                   530–538, 2014.
Chebbi, Z. and Moakher, M. Means of Hermitian
                                                                 Jewson, J., Smith, J. Q., and Holmes, C. Principles of
  positive-definite matrices based on the log-determinant α-
                                                                   Bayesian inference using general divergence criteria. En-
  divergence function. Linear Algebra and its Applications,
                                                                   tropy, 20(6):442, 2018.
  436(7):1872 – 1889, 2012.
                                                                 Knoblauch, J. Robust deep Gaussian processes. arXiv
Chérief-Abdellatif, B.-E. and Alquier, P. MMD-Bayes: Ro-          preprint arXiv:1904.02303, 2019.
  bust Bayesian estimation via maximum mean discrepancy.
  arXiv preprint arXiv:1909.13339, 2019.                         Knoblauch, J., Jewson, J., and Damoulas, T. Generalized
                                                                  variational inference. arXiv preprint arXiv:1904.02063,
Cox, D. R. Tests of separate families of hypotheses. In            2019.
  Proceedings of the Fourth Berkeley Symposium on Math-
  ematical Statistics and Probability, pp. 105–123, 1961.        Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-
                                                                   Abadeh, S. Wasserstein distributionally robust optimiza-
Cox, D. R. A return to an old paper: ‘tests of separate            tion: Theory and applications in machine learning. IN-
  families of hypotheses’. Journal of the Royal Statistical        FORMS TutORials in Operations Research, pp. 130–166,
  Society: Series B, 75(2):207–215, 2013.                          2019.
Robust Bayesian Classification Using an Optimistic Score Ratio

Ledoit, O. and Wolf, M. A well-conditioned estimator            Rippl, T., Munk, A., and Sturm, A. Limit laws of the
  for large-dimensional covariance matrices. Journal of           empirical Wasserstein distance: Gaussian distributions.
  Multivariate Analysis, 88(2):365 – 411, 2004.                   Journal of Multivariate Analysis, 151:90–109, 2016.

Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified        Schervish, M. J. Theory of Statistics. Springer, 1995.
  framework for detecting out-of-distribution samples and
                                                                Staib, M. and Jegelka, S. Distributionally robust optimiza-
  adversarial attacks. In Advances in Neural Information
                                                                  tion and generalization in kernel methods. In Advances
  Processing Systems 31, pp. 7167–7177, 2018.
                                                                  in Neural Information Processing Systems 32, pp. 9134–
Miller, J. W. and Dunson, D. B. Robust Bayesian infer-            9144, 2019.
 ence via coarsening. Journal of the American Statistical       Tomar, D. and Agarwal, S. A survey on data mining ap-
 Association, 114(527):1113–1125, 2019.                           proaches for healthcare. International Journal of Bio-
                                                                  Science and Bio-Technology, 5(5):241–266, 2013.
Mohajerin Esfahani, P. and Kuhn, D. Data-driven distribu-
 tionally robust optimization using the Wasserstein met-        Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf,
 ric: Performance guarantees and tractable reformulations.        M. P. Approximate Bayesian computation scheme for
 Mathematical Programming, 171(1-2):115–166, 2018.                parameter inference and model selection in dynamical
                                                                  systems. Journal of The Royal Society Interface, 6(31):
Moreno-Torres, J., Raeder, T., Alaiz-Rodrı́guez, R., Chawla,      187–202, 2009.
 N., and Herrera, F. A unifying view on dataset shift
 in classification. Pattern Recognition, 45(1):521 – 530,       Walker, S. G. Bayesian inference with misspecified models.
 2012.                                                           Journal of Statistical Planning and Inference, 143(10):
                                                                 1621–1633, 2013.
Murphy, K. Machine Learning: A Probabilistic Perspective.
 MIT Press, 2012.                                               Wiesemann, W., Kuhn, D., and Sim, M. Distributionally
                                                                 robust convex optimization. Operations Research, 62(6):
Namkoong, H. and Duchi, J. C. Stochastic gradient                1358–1376, 2014.
  methods for distributionally robust optimization with f-
                                                                Wood, S. N. Statistical inference for noisy nonlinear eco-
  divergences. In Advances in Neural Information Process-
                                                                 logical dynamic systems. Nature, 466:1102–1104, 2010.
  ing Systems 29, pp. 2208–2216, 2016.
                                                                Zhao, C. and Guan, Y. Data-driven risk-averse stochastic op-
Naraei, P., Abhari, A., and Sadeghian, A. Application of          timization with Wasserstein metric. Operations Research
  multilayer perceptron neural networks and support vector        Letters, 46(2):262 – 267, 2018.
  machines in classification of healthcare data. In 2016 Fu-
  ture Technologies Conference, pp. 848–852. IEEE, 2016.        Zhu, W., Miao, J., Hu, J., and Qing, L. Vehicle detection
                                                                  in driving simulation using extreme learning machine.
Nguyen, V. A., Shafieezadeh-Abadeh, S., Yue, M.-C., Kuhn,         Neurocomputing, 128:160–165, 2014.
  D., and Wiesemann, W. Calculating optimistic likeli-
  hoods using (geodesically) convex optimization. In Ad-
  vances in Neural Information Processing Systems 32, pp.
 13942–13953, 2019a.

Nguyen, V. A., Shafieezadeh-Abadeh, S., Yue, M.-C., Kuhn,
  D., and Wiesemann, W. Optimistic distributionally robust
  optimization for nonparametric likelihood approximation.
  In Advances in Neural Information Processing Systems
 32, pp. 15872–15882, 2019b.

Ong, V. M. H., Nott, D. J., Tran, M.-N., Sisson, S. A.,
  and Drovandi, C. C. Variational Bayes with synthetic
  likelihood. Statistics and Computing, 28(4):971–988,
  2018.

Price, L. F., Drovandi, C. C., Lee, A., and Nott, D. J.
  Bayesian synthetic likelihood. Journal of Computational
  and Graphical Statistics, 27(1):1–11, 2018.
You can also read