Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

Page created by Joe Martin
 
CONTINUE READING
Optimal Quantisation of Probability Measures
                        Using Maximum Mean Discrepancy

       Onur Teymur                Jackson Gorham                  Marina Riabiz                  Chris. J. Oates
    Newcastle University           Whisper.ai, Inc.            King’s College London            Newcastle University
    Alan Turing Institute                                      Alan Turing Institute            Alan Turing Institute

                       Abstract                               sation picks the xi to minimise a Wasserstein distance
                                                              between ν and µ, which leads to an elegant connection
                                                              with Voronoi partitions whose centres are the xi (Graf
      Several researchers have proposed minimisa-
                                                              and Luschgy, 2007). Several other discrepancies exist
      tion of maximum mean discrepancy (MMD)
                                                              but are less well-studied for the quantisation task. In
      as a method to quantise probability mea-
                                                              this paper we study quantisation with maximum mean
      sures, i.e., to approximate a distribution by a
                                                              discrepancy (MMD), as well as a specific version called
      representative point set. We consider sequen-
                                                              kernel Stein discrepancy (KSD), each of which admit
      tial algorithms that greedily minimise MMD
                                                              a closed-form expression for a wide class of target dis-
      over a discrete candidate set. We propose a
                                                              tributions µ (e.g. Rustamov, 2019).
      novel non-myopic algorithm and, in order to
      both improve statistical efficiency and reduce            Despite several interesting results, optimal quantisa-
      computational cost, we investigate a variant            tion with MMD remains largely unsolved. Quasi
      that applies this technique to a mini-batch of          Monte Carlo (QMC) provides representative point sets
      the candidate set at each iteration. When the           that asymptotically minimise MMD (Hickernell, 1998;
      candidate points are sampled from the target,           Dick and Pillichshammer, 2010); however, these re-
      the consistency of these new algorithms—and             sults are typically limited to specific instances of µ and
      their mini-batch variants—is established. We            MMD.1 The use of greedy sequential algorithms, in
      demonstrate the algorithms on a range of im-            which xn is selected conditional on the x1 , . . . , xn−1 al-
      portant computational problems, including               ready chosen, has received some attention in the MMD
      optimisation of nodes in Bayesian cubature              context—see Santin and Haasdonk (2017) and the sur-
      and the thinning of Markov chain output.                veys in Oettershagen (2017) and Pronzato and Zhigl-
                                                              javsky (2018). Greedy sequential algorithms have also
                                                              been proposed for KSD (Chen et al., 2018, 2019), as
1     Introduction                                            well as a non-greedy sequential algorithm for minimis-
                                                              ing MMD, called kernel herding (Chen et al., 2010).
This paper considers the approximation of a proba-            In certain situations,2 the greedy and herding algo-
                           !n on a set X , by a dis-
bility distribution µ, defined                                rithms produce the same sequence of points, with the
crete distribution ν = n1 i=1 δ(xi ), for some repre-         latter theoretically understood due to its interpreta-
sentative points xi , where δ(x) denotes a point mass         tion as a Frank–Wolfe algorithm (Bach et al., 2012;
located at x ∈ X . This quantisation task arises in           Lacoste-Julien et al., 2015). Outside the translation-
many areas including numerical cubature (Karvonen,            invariant context, empirical studies have shown that
2019), experimental design (Chaloner and Verdinelli,          greedy algorithms tend to outperform kernel herd-
1995) and Bayesian computation (Riabiz et al., 2020).         ing (Chen et al., 2018). Information-theoretic lower
To solve the quantisation task one first identifies an        bounds on MMD have been derived in the liter-
optimality criterion, typically a notion of discrepancy       ature on information-based complexity (Novak and
between µ and ν, and then develops an algorithm to            Woźniakowski, 2008) and in Mak et al. (2018), who
approximately minimise it. Classical optimal quanti-
                                                                 1
                                                                   In Section 2.1 we explain how MMD is parametrised by
Proceedings of the 24th International Conference on Artifi-   a kernel ; the QMC literature typically focuses on µ uniform
cial Intelligence and Statistics (AISTATS) 2021, San Diego,   on [0, 1]d , and d-dim tensor products of kernels over [0, 1].
California, USA. PMLR: Volume 130. Copyright 2021 by             2
                                                                   Specifically, the algorithms coincide when the kernel
the author(s).
                                                              on which they are based is translation-invariant.
Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

studied representative points that minimise an energy       The remainder of the paper is structured thus. In Sec-
distance; the relationship between energy distances         tion 2 we provide background on MMD and KSD. In
and MMD is clarified in Sejdinovic et al. (2013).           Section 3 our novel methods for optimal quantisation
                                                            are presented. Our empirical assessment, including
The aforementioned sequential algorithms require
                                                            comparisons with existing methods, is in Section 4 and
that, to select the next point xn , one has to search
                                                            our theoretical assessment is in Section 5. The paper
over the whole set X . This is often impractical, since
                                                            concludes with a discussion in Section 6.
X will typically be an infinite set and may not have
useful structure (e.g. a vector space) that can be ex-
ploited by a numerical optimisation method.                 2     Background
Extended notions of greedy optimisation, where at           Let X be a measurable space and let P(X ) denote
each step one seeks to add a point that is merely ‘suffi-     the set of probability distributions on X . First we
ciently close’ to optimal, were studied for KSD in Chen     introduce a notion of discrepancy between two mea-
et al. (2018). Chen et al. (2019) proposed a stochastic     sures µ, ν ∈ P(X ), and then specialise this definition
optimisation approach for this purpose. However, the        to MMD (Section 2.1) and KSD (Section 2.2).
global non-convex optimisation problem that must be
solved to find the next point xn becomes increasingly       For any µ, ν ∈ P(X ) and set F consisting of real-
difficult as more points are selected. This manifests,        valued measurable functions on X , we define a dis-
for example, in the increasing number of iterations re-     crepancy to be a quantity of the form
quired in the approach of Chen et al. (2019).                                           "#       #    "
                                                                    DF (µ, ν) = supf ∈F " f dµ − f dν " ,   (1)
This paper studies sequential minimisation of MMD
over a finite candidate set, instead of over the whole      assuming F was chosen so that all integrals in (1) exist.
of X . This obviates the need to use a numerical optimi-    The set F is called measure-determining if DF (µ, ν) =
sation routine, requiring only that a suitable candidate    0 implies µ = ν, and in this case DF is called an
set can be produced. Such an approach was recently          integral probability metric (Müller, 1997). An example
described in Riabiz et al. (2020), where an algorithm       is the Wasserstein metric—induced by choosing F as
termed Stein thinning was proposed for greedy min-          the set of 1-Lipschitz functions defined on X —that
imisation of KSD. Discrete candidate sets in the con-       is used in classical quantisation (Dudley, 2018, Thm.
text of kernel herding were discussed in Chen et al.        11.8.2). Next we describe how MMD and KSD are
(2010) and Lacoste-Julien et al. (2015), and in Paige       induced from specific choices of F.
et al. (2016) for the case that the subset is chosen from
the support of a discrete target. Mak et al. (2018) pro-    2.1   Maximum Mean Discrepancy
posed sequential selection from a discrete candidate set
to approximately minimise energy distance, but theo-        Consider a symmetric and positive-definite function
retical analysis of this algorithm was not attempted.       k : X × X → R, which we call a kernel. A kernel re-
                                                            produces a Hilbert space of functions H from X → R
The novel contributions of this paper are as follows:       if (i) for all x ∈ X we have k(·, x) ∈ H, and (ii) for
                                                            all x ∈ X and f ∈ H we have 〈k(·, x), f 〉H = f (x),
  • We study greedy algorithms for sequential min-          where 〈·, ·〉H denotes the inner product in H. By the
    imisation of MMD, including novel non-myopic al-        Moore–Aronszajn theorem (Aronszajn, 1950), there is
    gorithms in which multiple points are selected si-      a one-to-one mapping between the kernel k and the
    multaneously. These algorithms are also extended        reproducing kernel Hilbert space (RKHS) H, which we
    to allow for mini-batching of the candidate set.        make explicit by writing H(k). A prototypical exam-
    Consistency is established and a finite-sample-size     ple of a kernel on X ⊆ Rd is the squared-exponential
    error bound is provided.                                kernel k(x, y; ℓ) = exp(− 12 ℓ−2 (x − y(2 ), where ( · ( in
  • We show how non-myopic algorithms can be cast           this paper denotes the Euclidean norm and ℓ > 0 is a
    as integer quadratic programmes (IQP) that can          positive scaling constant.
    be exactly solved using standard libraries.             Choosing the set F in (1) to be the unit ball B(k) :=
  • A detailed empirical assessment is presented, in-       {f ∈ H(k) : 〈f, f 〉H(k) ≤ 1} of the RKHS H(k) enables
    cluding a study varying the extent of non-myopic        the supremum in (1) to be written in closed form and
    selection, up to and including the limiting case in     defines the MMD (Song, 2008):
    which all points are selected simultaneously. Such
                                                            MMDµ,k (ν)2 := DB(k) (µ, ν)
    non-sequential algorithms require high computa-            ##                         ##
    tional expenditure, and so a semi-definite relax-        =    k(x, y) dν(x) dν(y) − 2 k(x, y) dν(x) dµ(y)
                                                                                    ##
    ation of the IQP is considered in the supplement.                             +     k(x, y) dµ(x) dµ(y) (2)
Teymur, Gorham, Riabiz, Oates

Our notation emphasises ν as the variable of interest,        where subscripts are used to denote the argument upon
since in this paper we aim to minimise MMD over pos-          which a differential operator
                                                                                      #       acts. Since kµ (x, ·) ∈
sible ν for a fixed kernel k and a fixed target µ. Under      H(kµ ), it follows that kµ (x, ·) dµ(x) = 0 and from
suitable conditions on k and X it can be shown that           (2) we arrive at the kernel Stein discrepancy (KSD)
MMD is a metric on P(X ) (in which case the kernel is                                 ##
called characteristic); for sufficient conditions see Sec-             MMDµ,kµ (ν)2 =       kµ (x, y) dν(x) dν(y).
tion 3 of Sriperumbudur et al. (2010). Furthermore,
                                                              Under stronger conditions on µ and k it can be shown
under stronger conditions on k, MMD metrises the
                                                              that KSD controls weak convergence to µ in P(Rd ),
weak topology on P(X ) (Sriperumbudur et al., 2010,
                                                              meaning that if MMDµ,kµ (ν) → 0 then ν ⇒ µ
Thms. 23, 24). This provides theoretical justification
                                                              (Gorham and Mackey, 2017, Thm. 8). The description
for minimisation of MMD: if MMDµ,k (ν) → 0 then
                                                              of KSD here is limited to Rd , but constructions also
ν ⇒ µ, where ⇒ denotes weak convergence in P(X ).
                                                              exist for discrete spaces (Yang et al., 2018) and more
Evaluation of MMD requires that µ and ν are either            general Riemannian manifolds (Barp et al., 2018; Xu
explicit or can be easily approximated (e.g. by sam-          and Matsuda, 2020; Le et al., 2020). Extensions that
pling), so as to compute the integrals appearing in           use other operators Aµ (Gorham et al., 2019; Barp
(2). This is the case in many applications and MMD            et al., 2019) have also been studied.
has been widely used (Arbel et al., 2019; Briol et al.,
2019a; Chérief-Abdellatif and Alquier, 2020). In cases
                                                              3     Methods
where µ is not explicit, such as when it arises as an
intractable posterior in a Bayesian context, KSD can
                                                              In this section we propose novel algorithms for minimi-
be a useful specialisation of MMD that circumvents
                                                              sation of MMD over a finite candidate set. The sim-
integration with respect to µ. We describe this next.
                                                              plest algorithm is described in Section 3.1, and from
                                                              this we generalise to consider both non-myopic selec-
2.2    Kernel Stein Discrepancy                               tion of representative points and mini-batching in Sec-
                                                              tion 3.2. A discussion of non-sequential algorithms,
While originally proposed as a means of proving dis-
                                                              as the limit of non-myopic algorithms where all points
tributional convergence, Stein’s method (Stein, 1972)
                                                              are simultaneously selected, is given in Section 3.3.
can be used to circumvent the integration against µ
required in (2) to calculate the MMD. Suppose we
have an operator   Aµ defined on a set of functions G         3.1   A Simple Algorithm for Quantisation
           #
such that Aµ g dµ = 0 holds for all g ∈ G. Choosing
                                                              In what follows we assume that we are provided with
F = Aµ G := {Aµ g : g "∈
                       # G} in (1),
                                " we would then have          a finite candidate set {xi }ni=1 ⊂ X from which rep-
DAµ G (µ, ν) = supg∈G " Aµ g dν ", an expression which
                                                              resentative points are to be selected. Ideally, these
no longer involves integrals with respect to µ. Appro-
                                                              candidates should be in regions where µ is supported,
priate choices for Aµ and G were studied in Gorham
                                                              but we defer making any assumptions on this set un-
and Mackey (2015), who termed DAµ G the Stein dis-
                                                              til the theoretical analysis in Section 5. The simplest
crepancy, and these will now be described.
                                                              algorithm that we consider greedily minimises MMD
Assume µ admits a positive and continuously differen-          over the candidate set; for each i, pick
tiable density pµ on X = Rd ; let ∇ and ∇· denote the                                  $ !                         %
                                                                                              i−1
gradient and divergence operators respectively; and let       π(i) ∈ argmin MMDµ,k 1i i′ =1 δ(xπ(i′ ) ) + 1i δ(xj ) ,
k : Rd × Rd → R be a kernel that is continuously dif-                j∈{1,...,n}

ferentiable in each argument. Then take                       to obtain, after m steps, an index sequence π ∈
      Aµ g := ∇ · g + uµ · g,
                           uµ := ∇ log pµ ,                   {1, . . . , n}m and associated empirical distribution ν =
                                                               1
                                                                 !m                                 !0
                         !d                                   m      i=1 δ(xπ(i) ). (The convention    i=1 = 0 is used.)
        G := {g : R → R : i=1 〈gi , gi 〉H(k) ≤ 1}.
                    d      d
                                                              Explicit formulae are contained in Algorithm 1. The
                                                              computational complexity of selecting m points in this
Note that G is the unit ball in the d-dimensional tensor
                                                              manner is O(m2 n), provided that the integrals appear-
product of H(k). Under mild conditions on k and µ
                                                              ing in Algorithm 1 can be evaluated in O(1). Note that
(Gorham
#         and Mackey, 2017, Prop. 1), it holds that
                                                              candidate points can be selected more than once.
  Aµ g dµ = 0 for all g ∈ G. The set Aµ G can then be
shown (Oates et al., 2017) to be the unit ball B(kµ ) in      Theorems 1, 2 and 4 in Section 5 provide novel finite-
a different RKHS H(kµ ) with reproducing kernel                sample-size error bounds for Algorithm 1 (as a special
                                                              case of Algorithm 2). The two main shortcomings of
 kµ (x, y) := ∇x · ∇y k(x, y) + ∇x k(x, y) · uµ (y)           Algorithm 1 are that (i) the myopic nature of the op-
                                                        (3)
        + ∇y k(x, y) · uµ (x) + k(x, y)uµ (x) · uµ (y),       timisation may be statistically inefficient, and (ii) the
Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

Algorithm 1: Myopic minimisation of MMD                         Algorithm 2: Non-myopic minimisation of MMD
Data: A set {xi }ni=1 , a distribution µ, a kernel k and        Data: A set {xi }ni=1 , a distribution µ, a kernel k, a
       a number m ∈ N of output points                                 number of points to select per iteration s ∈ N
Result: An index sequence π ∈ {1, . . . , n}m                          and a total number of iterations m ∈ N
for i = 1, . . . , m do&                                        Result: An index sequence π ∈ {1, . . . , n}m×s
                                   !i−1                         for i = 1, . . . , m do &
   π(i) ∈ argmin 12 k(xj , xj ) + i′ =1 k(xπ(i′ ) , xj )                                    !
           j∈{1,...,n}                 #                 '         π(i, ·) ∈ argmin 12 j,j ′ ∈S k(xj , xj ′ )
                                     −i k(x, xj )dµ(x)                         S∈{1,...,n}s
end                                                                                  !i−1 !s !
                                                                                   + i′ =1 j=1 j ′ ∈S k(xπ(i′ ,j) , xj ′ )
                                                                                                 !      #                  '
requirement to scan through a large candidate set dur-                                       −is j∈S k(x, xj )dµ(x)
                                                                end
ing each iteration may lead to unacceptable computa-
tional cost. In Section 3.2 we propose non-myopic and
                                                                         argmin 12 v ⊤ Kv + ci⊤ v    s.t.   1⊤ v = s      (4)
mini-batch extensions to address these issues.                            v∈Ns0

                                                                     Kj,j ′ := k(xj , xj ′ ), 1j := 1 for j = 1, . . . , n,
3.2     Generalised Sequential Algorithms                             !        !                              #
                                                                          i−1    s
                                                                cij := i′ =1 j ′ =1 k(xπ(i′ ,j ′ ) , xj ) − is k(x, xj ) dµ(x)
In Section 3.2.1 we describe a non-myopic extension of
Algorithm 1, where multiple points are simultaneously           Remark 1. If one further imposes the constraint
selected at each step. The use of non-myopic optimisa-          vi ∈ {0, 1} for all i, so that each candidate may be se-
tion is impractical when a large candidate set is used,         lected at most once, then the resulting binary quadratic
and therefore we explain how mini-batches from the              programme (BQP) is equivalent to the cardinality con-
candidate set can be employed in Section 3.2.2.                 strained k-partition problem from discrete optimisa-
                                                                tion, which is known to be NP-hard (Rendl, 2016).
3.2.1    Non-Myopic Minimisation                                (The results we present do not impose this constraint.)
Now we consider the simultaneous selection of s > 1
representative points from the candidate set at each            3.2.2    Mini-Batching
step. This leads to the non-myopic algorithm
                                                                The exact solution of (4) is practical only for moderate
                          $ !
                            1  i−1 !s                           values of s and n. This motivates the idea of consider-
 π(i, ·) ∈ argmin MMDµ,k is    i′ =1   j=1 δ(xπ(i′ ,j) )
          S∈{1,...,n}s                    !              %      ing only a subset of the n candidates at each iteration,
                                       1
                                     + is   j∈S δ(xj ) ,        a procedure we call mini-batching and inspired by the
                                                                similar idea from stochastic optimisation. There are
whose output is a bivariate index π ∈ {1, . . . , n}m×s ,       several ways that mini-batching can be performed, but
together
    !m with
          !s the associated empirical distribution ν =          here we simply state that candidates denoted {xij }bj=1
 1
ms    i=1   j=1 δ(xπ(i,j) ). Explicit formulae are con-         are considered during the ith iteration, with the mini-
tained in Algorithm 2. The computational complexity             batch size denoted by b ∈ N. The non-myopic algo-
of selecting ms points in this manner is O(m2 s2 ns ),          rithm for minimisation of MMD with mini-batching is
which is larger than Algorithm 1 when s > 1. The-               then
orems 1, 2 and 4 in Section 5 provide novel finite-                                        $ !
                                                                                             1   i−1 !s         i′
sample-size error bounds for Algorithm 2.                        π(i, ·) ∈ argmin MMDµ,k is      i′ =1  j=1 δ(xπ(i′ ,j) )
                                                                          S∈{1,...,b}s                      !              %
                                                                                                         1             i
Despite its daunting computational complexity, we                                                      + is   j∈S  δ(x j )
have found that it is practical to exactly implement
Algorithm 2 for moderate values of s and n by cast-             Explicit formulae are contained in Algorithm 3. The
ing each iteration of the algorithm as an instance of a         complexity of selecting ms points in this manner is
constrained integer quadratic programme (IQP) (e.g.             O(m2 s2 bs ), which is smaller than Algorithm 2 when
Wolsey, 2020), so that state-of-the-art discrete opti-          b < n. As with Algorithm 2, an exact IQP formulation
misation methods can be employed. To this end, we               can be employed. Theorem 3 provides a novel finite-
represent the indices S ⊂ {1, . . . , n}s of the s points       sample-size error bound for Algorithm 3.
to be selected at iteration i as a vector v ∈ {0, . . . , s}n
whose jth element indicates the number of copies of             3.3     Non-Sequential Algorithms
xj that!are selected, and where v is constrained to
          n
satisfy j=1 vj = s. It is then an algebraic exercise            Finally we consider the limit of the non-myopic Al-
to recast an optimal subset π(i, ·) as the solution to a        gorithm 2, in which all m representative points are
constrained IQP:                                                simultaneously selected in a single step:
Teymur, Gorham, Riabiz, Oates

Figure 1: Quantisation of a Gaussian mixture model using MMD. A candidate set of 1000 independent samples
(left), from which 12 representative points were selected using: the myopic method (centre-left); non-myopic
selection, picking 4 points at a time (centre-right), and by simultaneous selection of all 12 points (right). Simu-
lations were conducted using Algorithms 1 and 2, with a Gaussian kernel whose length-scale was ℓ = 0.25.

Algorithm 3: Non-myopic minimisation of MMD                            4     Empirical Assessment
with mini-batching
                                                                       This section presents an empirical assessment3 of Algo-
Data: A set {{xij }bj=1 }m  i=1 of mini-batches, each of
                                                                       rithms 1–3. Two regimes are considered, correspond-
       size b ∈ N, a distribution µ, a kernel k, a
                                                                       ing to high compression (small sm/n; Section 4.1)
       number of points to select per iteration s ∈ N,
                                                                       and low compression (large sm/n; Section 4.2) of the
       and a number of iterations m ∈ N
                                                                       target. These occur, respectively, in applications to
Result: An index sequence π ∈ {1, . . . , n}m×s
                                                                       Bayesian cubature and thinning of Markov chain out-
for i = 1, . . . , m do &
                             !                                         put. In Section 4.3, we compare our method to a
   π(i, ·) ∈ argmin 12 j,j ′ ∈S k(xij , xij ′ )                        variety of others based on optimisation in continuous
               S∈{1,...,b}s
                    !i−1 !s !               ′                          spaces, augmenting a study in Chen et al. (2019). For
                  + i′ =1 j=1 j ′ ∈S k(xiπ(i′ ,j) , xij ′ )            details of the kernels used, and a sensitivity analysis
                                    !     #                 '
                                −is j∈S k(x, xij )dµ(x)                for the kernel parameters, see Appendix C.
end
                                                                       Figure 1 illustrates how a non-myopic algorithm may
                               $                             %         outperform a myopic one. A candidate set was con-
                                   1
                                       !                               structed using 1000 independent samples from a test
    π∈     argmin MMDµ,k           m       i∈S   δ(xπ(i) )       (5)
         S∈{1,...,n}m                                                  measure, and 12 representative points selected using
                                                                       the myopic (Alg. 1 with m = 12), non-myopic (Alg.
The index set π can again be recast as the solu-                       2 with m = 3 and s = 4), and non-sequential (Alg. 2
tion to !
        an IQP and the associated empirical measure                    with m = 1 and s = 12) approaches. After choosing
      1   m
ν = m     i=1 δ(xπ(i) ) provides, by definition, a value               the first three samples close to the three modes, the
for MMDµ,k (ν) that is at least as small as any of the                 myopic method then selects points that temporarily
methods so far described (thus satisfying the same er-                 worsen the overall approximation; note in particular
ror bounds derived in Theorems 1–4).                                   the placement of the fourth point. The non-myopic
However, it is only practical to exactly solve (5) for                 methods do not suffer to the same extent: choosing 4
small m and thus, to arrive at a practical algorithm,                  points together gives better approximations after each
we consider approximation of (5). There are at least                   of 4, 8 and 12 samples have been chosen (s = 4 was
two natural ways to do this. Firstly, one could run a                  chosen deliberately so as to be co-prime to the number
numerical solver for the IQP formulation of (5) and                    of mixture components, 3). Choosing all 12 points at
terminate after a fixed computational limit is reached;                once gives an even better approximation.
the solver will return a feasible, but not necessarily
optimal, solution to the IQP. An advantage of this                     4.1   Bayesian Cubature
approach is that no further methodological work is re-
quired. Alternatively, one could employ a convex re-                   Larkin (1972) and subsequent authors proposed to cast
laxation of the intractable IQP, which introduces an                   numerical cubature in the Bayesian framework, such
approximation error that is hard to quantify but leads                 that an integrand f is a priori modelled as a Gaussian
to a convex problem that may be exactly soluble at                     process with covariance k, then conditioned on data
reasonable cost. We expand on the latter approach in                      3
                                                                            Our code is written in Python and is available at
Appendix B, with preliminary empirical comparisons.                    https://github.com/oteym/OptQuantMMD
Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

Figure 2: Synthetic data model              Figure 3: KSD vs. wall-clock time, and KSD × time vs. number of
formed of a mixture of 20 bivariate         selected samples, shown for the 38-dim calcium signalling model (top two
Gaussians (top). Effect of varying           panes) and 4-dim Lotka–Volterra model (bottom two panes). The kernel
the number s of simultaneously-             length-scale in each case was set using the median heuristic (Garreau
chosen points on MMD when                   et al., 2017), and estimated in practice using a uniform subsample of
choosing 60 from 1000 indepen-              1000 points for each model. The myopic algorithm of Riabiz et al. (2020)
dently sampled points (bottom).             is included in the Lotka–Volterra plots—see main text for details.

D = {f (xi )}ni=1 as the integrand is evaluated. The         selections are seen to outperform more myopic ones.
posterior standard deviation is (Briol et al., 2019b)        Note in particular that MMD of the myopic method
     (#         )                  $! m           %          (s = 1) is observed to decrease non-monotonically.
 Std f dµ|D = min MMDµ,k                 wi δ(xi ) . (6)     This is a manifestation of the phenomenon also seen in
                 w1 ,...,wm ∈R        i=1
                                                             Figure 1, where a particular selection may temporarily
The selection of xi to minimise (6) is impractical since     worsen the quality of the overall approximation.
evaluation of (6) has complexity O(m3 ). Huszár and
Duvenaud (2012) and Briol et al. (2015) noted that           Next we consider applications in Bayesian statistics,
(6) can be bounded above by fixing wi = m   1
                                              , and that     where both approximation quality and computation
quantisation methods give a practical means to min-          time are important. In what follows the density pµ
imise this bound. All results in Section 5 can therefore     will be available only up to an unknown normalisa-
be applied and, moreover, the bound is expected to be        tion constant and thus KSD—which requires only that
quite tight—wi ≪ m  1
                      implies that xi was not optimally      uµ = ∇ log pµ can be evaluated—will be used.
                                                  1
placed, thus for optimal xi we anticipate wi ≈ m    .
BC is most often used when evaluation of f has a high        4.2   Thinning of Markov Chain Output
computational cost, and one is prepared to expend re-
sources in the optimisation of the point set {xi }m i=1 .    The use of quantisation to ‘thin’ Markov chain out-
Our focus in this section is therefore on the quality of     put was proposed in Riabiz et al. (2020), who studied
the point set obtained, irrespective of computational        greedy myopic algorithms based on KSD. We revisit
cost. Figure 1 suggested that the approximation qual-        the applications from that work to determine whether
ity of non-myopic methods depends on s. Figure 2             our methods offer a performance improvement. Unlike
compares the selection of 60 from 1000 independently         Section 4.1, the cost of our algorithms must now be as-
sampled points from a mixture of 20 Gaussians, vary-         sessed, since their runtime may be comparable to the
ing s. This gives a set of step functions. Less myopic       time required to produce Markov chain output itself.
Teymur, Gorham, Riabiz, Oates

The datasets4 consist of (i) 4 × 106 samples from the                     10
                                                                                       MALA       SP              SP-RWM INFL
38-parameter intracellular calcium signalling model of                     9
                                                                                       RWM        SP-MALA LAST    OPT MB 100-10
                                                                                       SVGD       SP-MALA INFL    OPT MB 10-1
Hinch et al. (2004), and (ii) 2 × 106 samples from a 4-                                MED        SP-RWM LAST     B-UNIF 100-10
                                                                           8
parameter Lotka–Volterra predator-prey model. The
MCMC chains are both highly auto-correlated and                            7

start far from any mode. The greedy KSD approach                           6

                                                                log KSD
of Riabiz et al. (2020) was found to slow down dra-                        5
matically after selecting 103 samples, due to the need
                                                                           4
to compute the kernel between selected points and all
points in the candidate set at each iteration. We em-                      3

ploy mini-batching (b < n) to ameliorate this, and also                    2
investigate the effectiveness of non-myopic selection.                      1

Figure 3 plots KSD against time, with the number of                                4          6             8         10          12
                                                                                                     log n eval
collected samples m fixed at 1000, as well as time-
adjusted KSD against m for both models. This ac-
                                                            Figure 4: Comparison of quality of approximation for
knowledges that both approximation quality and com-
                                                            various methods, adjusted by the number of evalua-
putational time are important. In both cases, larger
                                                            tions neval of µ or uµ . For details of line labels, refer
mini-batches were able to perform better provided that
                                                            to the main text, and to Fig. 4 of Chen et al. (2019).
s was large enough to realise their potential. Non-
myopic selection shows a significant improvement over       shown as a black solid line (OPT MB 100-10; b = 100,
batch-myopic in the Lotka–Volterra model, and a less        s = 10) and dashed line (OPT MB 10-1; b = 10,
significant (though still visible) improvement in the       s = 1), was applied to select 100 states from the first
calcium signalling model. The practical upper limit         mb states visited in the RWM sample path, with m
on b for non-myopic methods (due to the requirement         increasing and b fixed. The resulting quantisations are
to optimise over all b points) may make performance         competitive with those produced by existing methods,
for larger s poorer relatively; in a larger and more com-   at comparable computational cost. We additionally
plex model, there may be fewer than s ‘good’ samples        include a comparison with batch-uniform selection (B-
to choose from given moderate b. This suggests that         UNIF 100-10; b = 100, s = 10, drawn uniformly from
control of the ratio s/b may be important; the best         the RWM output) in light grey.
results we observed occurred when s/b = 10−1 .
A comparison to the original myopic algorithm (i.e.         5             Theoretical Assessment
b = n) of Riabiz et al. (2020) is incuded for the Lotka–
Volterra model. This is implemented using the same          This section presents a theoretical assessment of Al-
code and machine as the other simulations. The cyan         gorithms 1–3. Once stated, a standing assumption is
line shown in the bottom two panes of Figure 3 rep-         understood to hold for the remainder of the main text.
resents only 50 points (not 1000); collecting just these
                                                            Standing Assumption 1. Let X be a measurable
took 42 minutes. This algorithm is slower still for the
                                                            space equipped with a probability measure µ. Let k :
calcium signalling model, so it was omitted.
                                                            X ×X →
                                                              2
                                                                    ##R be symmetric positive definite and satisfy
                                                            Cµ,k := k(x, y)dµ(x)dµ(y) < ∞.
4.3      Comparison with Previous Approaches
                                                            For the kernels kµ described in Section 2.2, Cµ,kµ = 0
Here we compare against approaches based on continu-        and the assumption is trivially satisfied. Our first re-
ous optimisation, reproducing a 10-dimensional ODE          sult is a finite-sample-size error bound for non-myopic
inference task due to Chen et al. (2019). The aim           algorithms when the candidate set is fixed:
is to minimise KSD whilst controlling the number of         Theorem 1. Let {xi }ni=1 ⊂ X be fixed and let Cn,k     2
                                                                                                                       :=
evaluations neval of either the (un-normalised) target      maxi=1,...,n k(xi , xi ). Consider an index sequence π of
µ or its log-gradient uµ . Figure 4 reports results         length m and with selection size s produced by Algo-
for random walk Metropolis (RWM), the Metropolis-           rithm 2. Then for all m ≥ 1 it holds that
adjusted Langevin algorithm (MALA), Stein varia-
                                                                       $ !                            %2
tional gradient descent (SVGD), minimum energy de-                        1    m !s
                                                            MMDµ,k ms          i=1    j=1 δ(xπ(i,j) )
signs (MED), Stein points (SP), and Stein point                                                               $        %
                                                                                    * !n              +2     2 1+log m
MCMC (four flavours, denoted SP-∗, described in                ≤ min MMDµ,k                w  δ(x   )    + C             ,
                                                                                        i=1 i     i              m
Chen et al., 2019). The method from Algorithm 3,                          1⊤w=1
                                                                           wi ≥0
   4
       Available at https://doi.org/10.7910/DVN/MDKNWM.     with C := Cµ,k + Cn,k an m-independent constant.
Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

The proof is provided in Appendix A.1. Aside from               The proof is provided in Appendix A.3. The mini-
providing an explicit error bound, we see that the out-         batch size b plays an analogous role to n in Theorem 2
put of Algorithm 2 converges in MMD to the optimal              and must be asymptotically increased with m in order
(weighted) quantisation of µ that is achievable using           for Algorithm 3 to be consistent.
the candidate point set. Interestingly, all bounds we
                                                                In our second randomised setting the candidate set
present are independent of s.
                                                                arises as a Markov chain sample path. Let V be a
Remark 2. Theorems 1–4 are stated for general                   function V : X → [1, ∞) and, for a function f : X → R
MMD and apply in particular to KSD, for which we                and a measure µ on X , let
set k = kµ . These results extend the work of Chen                                                            "#    "
et al. (2018, 2019) and Riabiz et al. (2020), who con-          (f (V := supx∈X |fV (x)|
                                                                                    (x) , (µ(V := sup&f &V ≤1
                                                                                                              " f dµ" .
sidered only myopic algorithms (i.e., s = 1).                   A ψ-irreducible and aperiodic Markov chain with nth
Remark 3. The theoretical bounds being independent              step transition kernel Pn is V-uniformly ergodic if and
of s do not necessarily imply that s = 1 is optimal             only if there exists R ∈ [0, ∞) and ρ ∈ (0, 1) such that
in practice; indeed our empirical results in Section 4
suggest that it is not.                                                       (Pn (x, ·) − P (V ≤ RV (x)ρn                 (7)
                                                                for all initial states x ∈ X and all n ∈ N (see Thm.
Our remaining results explore the cases where the can-          16.0.1 of Meyn and Tweedie, 2012).
didate points are randomly sampled. Independent and                                            #
                                                                Theorem 4. Assume that k(x, ·)dµ(x) = 0 for all
dependent sampling is considered and, in each case,
                                                                x ∈ X . Consider a µ-invariant, time-homogeneous,
the following moment bound will be assumed:
                                                                reversible Markov chain {xi }i∈N ⊂ X generated us-
Standing( Assumption    )    2. For some γ > 0, C1 :=           ing a V-uniformly ergodic transition       kernel, such that
                                                                                                ,
supi∈N E eγk(xi ,xi ) < ∞, where the expectation is             (7) is satisfied with V (x) ≥ ,k(x, x) for all x ∈ X .
taken over x1 , . . . , xn .                                    Suppose that C2 := supi∈N E[ k(xi , xi )V (xi )] < ∞.
In the first randomised setting, the xi are indepen-            Consider an index sequence π of length m and selec-
dently sampled from µ, as would typically be possible           tion subset size s produced by Algorithm 2. Then, with
                                                                       2Rρ
when µ is explicit:                                             C3 = 1−ρ    , we have that
                                                                  &           $ !                           %2 '
Theorem 2. Let {xi }ni=1 ⊂ X be independently sam-                                    m !s
                                                                E MMDµ,k ms      1
                                                                                      i=1  j=1 δ(x π(i,j) )
pled from µ. Consider an index sequence π of length                                        $                     %$         %
m produced by Algorithm 2. Then for all s ∈ N and                       log(C1 )   C2 C3
                                                                    ≤ nγ + n + 2 Cµ,k          2
                                                                                                  + log(nC    1)    1+log m
                                                                                                                              .
                                                                                                           γ          m
all m, n ≥ 1 it holds that
    &         $ !                              %2 '
                       m !s                                     The proof is provided in Appendix A.4. Analysis of
  E MMDµ,k ms    1
                       i=1     j=1 δ(xπ(i,j) )                  mini-batching in the dependent sampling context ap-
                          $                   %$          %
          ≤ log(C 1)
                     +  2   C 2
                                  + log(nC1 )     1+log m
                                                            .   pears to be more challenging and was not attempted.
               nγ             µ,k       γ           m

The proof is provided in Appendix A.2. It is seen that          6    Discussion
n must be asymptotically increased with m in order
for the approximation provided by Algorithm 2 to be             This paper focused on quantisation using MMD,
consistent. No smoothness assumptions were placed               proposing and analysing novel algorithms for this task,
on k; such assumptions can be used to improve the               but other integral probability metrics could be con-
O(n−1 ) term (as in Thm. 1 of Ehler et al., 2019), but          sidered. More generally, if one is interested in com-
we did not consider this useful since the O(m−1 ) term          pression by means other than quantisation then other
is the bottleneck in the bound.                                 approaches may be useful, such as Gaussian mixture
An analogous (but more technically involved) argu-              models and related approaches from the literature on
ment leads to a finite-sample-size error bound when             density estimation (Silverman, 1986).
mini-batches are used:                                          Some avenues for further research include: (i) extend-
Theorem 3. Let each mini-batch {xij }bj=1 ⊂ X be                ing symmetric structure in µ to the set of representa-
independently sampled from µ. Consider an index se-             tive points (Karvonen et al., 2019); (ii) characterising
quence π of length m produced by Algorithm 3. Then              an optimal sampling distribution from which elements
∀ m, n ≥ 1                                                      of the candidate set can be obtained (Bach, 2017);
    &         $ !                          %2 '                 (iii) further applications of our method, for example
                     m !s
  E MMDµ,k ms    1
                     i=1  j=1 δ(x i
                                  π(i,j) )                      to Bayesian neural networks, where quantisation of the
                       $                 %$           %         posterior provides a promising route to reduce the cost
            log(C1 )     2     log(bC 1)      1+log m
         ≤ bγ + 2 Cµ,k      +       γ           m       .       of predicting each label in the test dataset.          •
Teymur, Gorham, Riabiz, Oates

Acknowledgments                                           Chérief-Abdellatif, B.-E. and Alquier, P. (2020).
                                                            MMD-Bayes: robust Bayesian estimation via max-
The authors are grateful to Lester Mackey and Luc           imum mean discrepancy. Proc. Mach. Learn. Res.,
Pronzato for helpful feedback on an earlier draft, and      18:1–21.
to Wilson Chen for sharing the code used to produce
Figure 4. This work was supported by the Lloyd’s          Dick, J. and Pillichshammer, F. (2010). Digital nets
Register Foundation programme on data-centric en-           and sequences: discrepancy theory and quasi–Monte
gineering at the Alan Turing Institute, UK. MR              Carlo integration. Cambridge University Press.
was supported by the British Heart Foundation–Alan        Dudley, R. M. (2018). Real analysis and probability.
Turing Institute cardiovascular data science award         CRC Press.
(BHF; SP/18/6/33805).                                     Ehler, M., Gräf, M., and Oates, C. J. (2019). Optimal
                                                            Monte Carlo integration on closed manifolds. Stat.
References                                                  Comput., 29(6):1203–1214.
Arbel, M., Korba, A., Salim, A., and Gretton, A.          Garreau, D., Jitkrittum, W., and Kanagawa, M.
  (2019). Maximum mean discrepancy gradient flow.          (2017). Large sample analysis of the median heuris-
  NeurIPS 32, pages 6484–6494.                             tic. arXiv:1707.07269.
Aronszajn, N. (1950). Theory of reproducing kernels.      Goemans, M. X. and Williamson, D. P. (1995). Im-
  Trans. Am. Math. Soc., 68(3):337–404.                    proved approximation algorithms for maximum cut
                                                           and satisfiability problems using semidefinite pro-
Bach, F. (2017). On the equivalence between kernel         gramming. J. ACM, 42(6):1115–1145.
  quadrature rules and random feature expansions. J.
  Mach. Learn. Res., 18(1):714–751.                       Gorham, J., Duncan, A., Mackey, L., and Vollmer, S.
                                                           (2019). Measuring sample quality with diffusions.
Bach, F., Lacoste-Julien, S., and Obozinski, G. (2012).    Ann. Appl. Probab., 29(5):2884–2928.
  On the equivalence between herding and conditional
  gradient algorithms. ICML 29.                           Gorham, J. and Mackey, L. (2015). Measuring sam-
                                                           ple quality with Stein’s method. NeurIPS 28, pages
Barp, A., Briol, F.-X., Duncan, A., Girolami, M., and      226–234.
  Mackey, L. (2019). Minimum Stein discrepancy es-
  timators. NeurIPS 32, pages 12964–12976.                Gorham, J. and Mackey, L. (2017). Measuring sample
                                                           quality with kernels. ICML 34, 70:1292–1301.
Barp, A., Oates, C., Porcu, E., Girolami, M.,
  et al. (2018). A Riemannian-Stein kernel method.        Graf, S. and Luschgy, H. (2007). Foundations of quan-
  arXiv:1810.04946.                                         tization for probability distributions. Springer.
Briol, F.-X., Barp, A., Duncan, A. B., and Giro-          Gurobi Optimization, LLC (2020). Gurobi Optimizer
  lami, M. (2019a). Statistical inference for gen-         Reference Manual. http://www.gurobi.com.
  erative models with maximum mean discrepancy.           Hickernell, F. (1998). A generalized discrepancy
  arXiv:1906.05944.                                         and quadrature error bound.    Math. Comput,
Briol, F.-X., Oates, C., Girolami, M., and Osborne,         67(221):299–322.
  M. A. (2015). Frank-Wolfe Bayesian quadrature:          Hinch, R., Greenstein, J., Tanskanen, A., Xu, L., and
  Probabilistic integration with theoretical guaran-        Winslow, R. (2004). A simplified local control model
  tees. NeurIPS 28, pages 1162–1170.                        of calcium-induced calcium release in cardiac ven-
Briol, F.-X., Oates, C. J., Girolami, M., Osborne,          tricular myocytes. Biophys. J, 87(6):3723–3736.
  M. A., Sejdinovic, D., et al. (2019b). Probabilis-      Huszár, F. and Duvenaud, D. (2012). Optimally-
  tic integration: A role in statistical computation?      weighted herding is Bayesian quadrature. UAI,
  Stat. Sci., 34(1):1–22.                                  28:377–386.
Chaloner, K. and Verdinelli, I. (1995). Bayesian exper-   Karvonen, T. (2019). Kernel-based and Bayesian
  imental design: a review. Stat. Sci., 10(3):273–304.     methods for numerical integration. PhD Thesis,
Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Giro-     Aalto University.
  lami, M., Mackey, L., and Oates, C. J. (2019). Stein    Karvonen, T., Särkkä, S., and Oates, C. (2019). Sym-
  point Markov chain Monte Carlo. ICML, 36.                metry exploits for Bayesian cubature methods. Stat.
Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X.,         Comput., 29:1231–1248.
  and Oates, C. J. (2018). Stein points. ICML, 35.        Lacoste-Julien, S., Lindsten, F., and Bach, F. (2015).
Chen, Y., Welling, M., and Smola, A. (2010). Super-         Sequential kernel herding: Frank-Wolfe optimiza-
  samples from kernel herding. UAI, 26:109–116.             tion for particle filtering. Proc. Mach. Learn. Res.,
                                                            38:544–552.
Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy

Larkin, F. M. (1972). Gaussian measure in Hilbert          Riabiz, M., Chen, W., Cockayne, J., Swietach,
  space and applications in numerical analysis. Rocky        P., Niederer, S. A., Mackey, L., and Oates, C.
  Mt. J. Math., 3:379–421.                                   (2020).   Optimal thinning of MCMC output.
Le, H., Lewis, A., Bharath, K., and Fallaize, C. (2020).     arXiv:2005.03952.
  A diffusion approach to Stein’s method on Rieman-         Rustamov, R. M. (2019). Closed-form expressions for
  nian manifolds. arXiv:2003.11497.                         maximum mean discrepancy with applications to
Mak, S., Joseph, V. R., et al. (2018). Support points.      Wasserstein auto-encoders. arXiv:1901.03227.
 Ann. Stat., 46(6A):2562–2592.                             Santin, G. and Haasdonk, B. (2017). Convergence
Meyn, S. P. and Tweedie, R. L. (2012). Markov chains         rate of the data-independent P -greedy algorithm
 and stochastic stability. Springer.                         in kernel-based approximation. Dolomites Research
                                                             Notes on Approximation, 10:68–78.
MOSEK ApS (2020). The MOSEK Optimizer API for
 Python 9.2.26. http://www.mosek.com.                      Sejdinovic, D., Sriperumbudur, B., Gretton, A., and
                                                             Fukumizu, K. (2013). Equivalence of distance-based
Müller, A. (1997). Integral probability metrics and         and RKHS-based statistics in hypothesis testing.
 their generating classes of functions. Adv. App.            Ann. Stat., 41(5):2263–2291.
 Prob., 29(2):429–443.
                                                           Silverman, B. W. (1986). Density estimation for statis-
Novak, E. and Woźniakowski, H. (2008). Tractability          tics and data analysis. CRC Press.
  of Multivariate Problems: Standard information for
                                                           Song, L. (2008). Learning via Hilbert space embedding
  functionals. European Mathematical Society.
                                                              of distributions. PhD thesis, University of Sydney.
Oates, C. J., Girolami, M., and Chopin, N. (2017).
                                                           Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,
 Control functionals for Monte Carlo integration. J.
                                                             Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert
 R. Statist. Soc. B, 3(79):695–718.
                                                             space embeddings and metrics on probability mea-
Oettershagen, J. (2017). Construction of optimal cu-         sures. J. Mach. Learn. Res., 11:1517–1561.
 bature algorithms with applications to econometrics
                                                           Stein, C. (1972). A bound for the error in the nor-
 and uncertainty quantification. PhD thesis, Univer-
                                                             mal approximation to the distribution of a sum of
 sity of Bonn.
                                                             dependent random variables. In Proc. Sixth Berke-
Paige, B., Sejdinovic, D., and Wood, F. (2016). Super-       ley Symp. on Math. Statist. and Prob., Vol. 2, pages
  sampling with a reservoir. UAI, 32:567–576.                583–602.
Pronzato, L. and Zhigljavsky, A. (2018). Bayesian          Wolsey, L. A. (2020). Integer Programming: 2nd Edi-
  quadrature, energy minimization, and space-filling        tion. John Wiley & Sons, Ltd.
  design. SIAM-ASA J. Uncert. Quant., 8(3):959–
                                                           Xu, W. and Matsuda, T. (2020). A Stein goodness-
  1011.
                                                            of-fit test for directional distributions. Proc. Mach.
Rendl, F. (2016). Semidefinite relaxations for parti-       Learn Res., 108:320–330.
  tioning, assignment and ordering problems. Ann.
                                                           Yang, J., Liu, Q., Rao, V., and Neville, J. (2018).
  Oper. Res., 240:119–140.
                                                             Goodness-of-fit testing for discrete distributions
                                                             via Stein discrepancy. Proc. Mach. Learn. Res.,
                                                             80:5561–5570.
You can also read