Foundations of Bayesian Learning from Synthetic Data - Proceedings of Machine Learning Research

Page created by Allen Hartman
Foundations of Bayesian Learning from Synthetic Data - Proceedings of Machine Learning Research
Foundations of Bayesian Learning from Synthetic Data

   Harrison Wilde                  Jack Jewson                 Sebastian Vollmer                Chris Holmes
Department of Statistics           Barcelona GSE              Department of Statistics,      Department of Statistics
 University of Warwick        Universitat Pompeu Fabra         Mathematics Institute          University of Oxford;
                                                               University of Warwick        The Alan Turing Institute

                      Abstract                                (Dwork et al., 2006), to define working bounds on the
                                                              probability that an adversary may identify whether a
                                                              particular observation is present in a dataset, given
    There is significant growth and interest in the
                                                              that they have access to all other observations in the
    use of synthetic data as an enabler for machine
                                                              dataset. DP’s formulation is context-dependent across
    learning in environments where the release of
                                                              the literature; here we amalgamate definitions regard-
    real data is restricted due to privacy or avail-
                                                              ing adjacent datasets from Dwork et al. (2014); Dwork
    ability constraints. Despite a large number of
                                                              and Lei (2009):
    methods for synthetic data generation, there
    are comparatively few results on the statisti-            Definition 1 ((ε, δ)-differential privacy) A ran-
    cal properties of models learnt on synthetic              domised function or algorithm K is said to be
    data, and fewer still for situations where a              (ε, δ)-differentially private if for all pairs of adjacent,
    researcher wishes to augment real data with               equally-sized datasets D and D0 that differ in one
    another party’s synthesised data. We use                  observation and all S ⊆ Range(K),
    a Bayesian paradigm to characterise the up-
    dating of model parameters when learning                         Pr[K(D) ∈ S] ≤ eε × Pr [K (D0 ) ∈ S] + δ        (1)
    in these settings, demonstrating that caution
    should be taken when applying conventional
    learning algorithms without appropriate con-              Current state-of-the-art approaches involve the privati-
    sideration of the synthetic data generating               sation of generative modelling architectures such as
    process and learning task at hand. Recent                 Generative Adversarial Networks (GANs), Variational
    results from general Bayesian updating sup-               Autoencoders (VAEs) or Bayesian Networks. This is
    port a novel and robust approach to Bayesian              achieved through adjustments to their learning pro-
    synthetic-learning founded on decision theory             cesses such that their outputs fulfil a DP guarantee
    that outperforms standard approaches across               specified at the point of training (e.g. Zhang et al.,
    repeated experiments on supervised learning               2017; Xie et al., 2018; Jordon et al., 2018; Rosenblatt
    and inference problems.                                   et al., 2020). Despite these contributions, a fundamen-
                                                              tal question remains regarding how, from a statistical
                                                              perspective, one should learn from privatised synthetic
                                                              data. Progress has been made for simple exponential
1    Introduction
                                                              family and regression models (Bernstein and Sheldon,
                                                              2018, 2019), but these model classes are of limited use
Privacy enhancing technologies comprise an area of
                                                              in modern machine learning applications.
rapid growth (The Royal Society, 2019). An important
aspect of this field concerns publishing privatised ver-      We characterise this problem for the first time via an
sions of datasets for learning; it is known that simply       adoption of the M -open world viewpoint (Bernardo and
anonymising the data is not sufficient to guarantee           Smith, 2001) associated with model misspecification;
individual privacy (e.g. Rocher et al., 2019). We in-         unifying the privacy and synthetic data generation lit-
stead adopt the Differential Privacy (DP) framework           erature alongside recent results in generalised Bayesian
                                                              updating (Bissiri et al., 2016) and minimum divergence
Proceedings of the 24th International Conference on Artifi-   inference (Jewson et al., 2018) to ask what it means
cial Intelligence and Statistics (AISTATS) 2021, San Diego,   to learn from synthetic data, and how can we improve
California, USA. PMLR: Volume 130. Copyright 2021 by          upon our inferences and predictions given that we ac-
the author(s).
                                                              knowledge its privatised synthetic nature?
Foundations of Bayesian Learning from Synthetic Data

This characterisation results in generative models that       density f0 (x) with respect to the Lebesque measure,
are ‘misspecificed by design’, owing to the constraints       such that x1:n ∼ F0 (x); we suppose xi ∈ Rd . These
imposed upon their design by requiring the fulfilment         observations are held privately by a data keeper K.
of a DP guarantee. This inevitably leads to discrepancy     • K uses data x1:n to produce an (ε, δ)-differentially
between the learner’s final model and the one that they       private synthetic data generating mechanism (S-
would otherwise have formulated if not for this DP            DGP). With a slight abuse of notation we use
restriction. In real-world, finite data contexts where        Gε,δ (x1:n ) to denote the S-DGP, noting that Gε,δ could
synthesis methods are often ‘black-box’ in nature, it is      be a fully generative model, or a private release mech-
difficult for a learner to fully capture and understand       anism that acts directly on the finite data x1:n (see
the inherent differences in the underlying distributions      discussion on the details of the S-DGP below). We
of the real and synthetic data that they have access to.      denote the density of this S-DGP as gε,δ .
                                                            • Let fθ (x) denote a learner L’s model likeli-
There are two key insights that we explore in this pa-
                                                              hood for F0 (x), parameterised by θ with prior
per following the characterisation above: Firstly, when
                                                              π̃(θ),  and marginal (predictive) likelihood p(x) =
left unchecked, the Bayesian inference machine learns         R
                                                                  f  (x)π̃(θ)dθ.
model parameters minimising the Kullback-Leibler di-            θ  θ
                                                            • L’s prior may already encompass some other set of
vergence (KLD) to the synthetic data generating process
                                                              real-data drawn from F0 leading to π̃(θ) = π(θ |
(S-DGP) (Berk et al., 1966; Walker, 2013) rather than
                                                              xL1:nL ), for nL ≥ 0 prior observations.
the true data generating process (DGP); Secondly, ro-
bust inference methods offer improved performance
                                                            We adopt a decision theoretic framework (Berger, 2013),
by acknowledging this misspecification, where in some
                                                            in assuming that L wishes to take some optimal action
cases synthetic data can otherwise significantly hinder
                                                            â in a prediction or inference task; satisfying:
learning rather than helping it.
In order to investigate these behaviours, we experiment                  â = arg max U (x, a)F0 (x)dx.        (2)
with models based on a mix of simulated-private and                             a∈A

real-world data to offer empirical insights on the learn-   This is with respect to a user-specified utility-function
ing procedure when a varying amount of real data is         U (x, a) that evaluates actions in the action space A,
available, and explore the optimalities in the amounts      and makes precise L’s desire to learn about F0 in order
of synthetic data with which to augment this real data.     to accurately identify â.
The contributions of our work are summarised below:         Details of the synthetic data generation mech-
1. Learning from synthetic data can lead to unpre-          anism. In defining Gε,δ , we believe it is important to
   dictable and negative outcomes, due to varying levels    differentiate between its two possible forms:
   of model misspecification introduced by its genera-
   tion and associated privacy constraints.                 1. Gε,δ (x1:n ) = Gε,δ (z | x1:n ): G is a privacy-
2. Robust Bayesian inference offers improvements over          preserving generative model fit on the real data,
   classical Bayes when learning from synthetic data.          such as the PATE-GAN (Jordon et al., 2018), DP-
                                                               GAN (Xie et al., 2018) or PrivBayes (Zhang et al.,
3. Real and synthetic data can be used in tandem to
   achieve practical effectiveness through the discovery       2017). Privatised synthetic data is produced by in-
   of desirable stopping points for learning, and optimal      jecting potentially heavy-tailed noise into gradient-
   model configurations.                                       based learning and/or through partitioned training
4. Consideration of the preferred properties of the in-        leveraging marginal distributions, aggregations and
   ference procedure are critical; the specific task at        subsets of the data. The S-DGP provides conditional
   hand can determine how best to use synthetic data.          independence between z1:m and x1:m and therefore
                                                               no longerR queries the real data after training.
                                                            2. Gε,δ = Kε,δ (x, dz)F0 (dx): A special case of this
We adopt a Bayesian standpoint throughout this paper,
                                                               integral comprises the convolution of F0 with some
but note that many of the results also hold in the
                                                               noise distribution H, such that Gε,δ = F0 ?Hε,δ . The
frequentist setting.
                                                               sampling distribution is therefore not a function of
                                                               the private data x1:n . In this case, the number of
2   Problem Formulation                                        samples that we can draw is limited to m ≤ n as
                                                               drawing one data item requires using one sample
We outline the inference problem as follows,                   of K’s data. Examples of this formulation include
                                                               the Laplace mechanism (Dwork et al., 2014) and
• Let x1:n denote a training set of n exchangeable             transformation-based privatisation (Aggarwal and
  observations from Nature’s true DGP, F0 (x) with             Yu, 2004).
Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

The fundamental problem of synthetic learning.               (KLD) of the model from the DGP of the data (Berk
L wants to learn about F0 but only has access to their       et al., 1966; Walker,  2013; Bissiri et al., 2016), where
prior π̃(θ) and to z1:m ∼ Gε,δ , where Gε,δ 6≡ F0 . That                        log (gε,δ/f ) dGε,δ .
                                                             KLD(gε,δ k f ) =
is, the S-DGP Gε,δ (·) is ‘misspecified by design’. This
                                                             As a result, if L updates their model fθ (x) using syn-
claim is supported by a number of observations:
                                                             thetic data z1:m ∼ Gε,δ (x1:n ), then as m → ∞ they
                                                             will be learning about the limiting parameter that min-
• L specifies a model p(x) using beliefs about the tar-
                                                             imises the KLD to the S-DGP:
  get F0 to be built using real data x1:n , they are then
  constrained by a subsequently imposed requirement
                                                                        θGKLD  = arg min KLD (gε,δ (·) k fθ (·)) ,       (3)
  of guaranteeing DP which instead requires consider-                      ε,δ
  ation of the resulting S-DGP Gε,δ ; this leads to an
  inevitable change in their beliefs such that the re-       and under regularity conditions the posterior distri-
  sulting model would be misspecified relative to the        bution concentrates around that point, π(θ | z1:m ) →
  original ‘best’ model for the true DGP F0 .                1θKLD as m → ∞.
• Correctly modelling a privacy preserving mechanism
  as part of an S-DGP such as a ‘black-box’ GAN or           Furthermore, this posterior will concentrate away from
  complex noise convolution is often intractable.            the model that is closest to F0 in KLD, corresponding
• There is inherent finiteness to real-world sensitive       to the limiting model that would be learnt given an
  data contexts that makes it difficult for a gener-         infinite real sample x1:∞ from F0 :
  ative model to capture the true DGP of even the
                                                                   θGKLD  6≡ θFKLD = arg min KLD (f0 (·) k fθ (·)) .     (4)
  non-private data it seeks to emulate. Considering                   ε,δ       0
  sufficiently large quantities of data, and sufficiently
  flexible models, would in theory allow for the true        This problem is exacerbated by the fact that S-DGP’s
  DGP to be learnt, but relying on asymptotic be-
                                                             can be prone to generating ‘outliers’ relative to the
  haviour in this setting is at odds with the definition     private data as a result of the injection of additional
  of DP given that the identifiability of individuals        noise into their training or generation to ensure (ε, δ)-
  would also naturally diminish as data availability         DP, combined with the fact that θG
                                                                                                         is known to be
  increases. Moreover, for non-trivial high-dimensional                                              ε,δ
                                                             non-robust to outliers (e.g. Basu et al., 1998). There-
  models, the amount of data required to properly cap-
                                                             fore, given that as we collect more synthetic data our
  ture the DGP often becomes infeasible, regardless of
                                                             inference is no longer minimising the KLD towards F0 ,
  the magnitude of DP constraints.
                                                             we must carefully consider and investigate whether our
                                                             inference is still ‘useful’ for learning about F0 at all.
Therefore, the posterior predictive converges to a differ-
ent distribution under real and synthetic data generat-
ing processes such that p(x | z1:m→∞ ) 6≡ p(x | x1:n→∞ ).    2.2      The approximation to F0
Learning from synthetic data is an intricate example
of learning under model misspecification, where the          Before we proceed any further we must consider what
misspecification is by K’s design. It is important, as       it means for data from Gε,δ to be ‘useful’ for learning
shown below, that this is recognised in the learning         about F0 . We can do so using the concepts of scoring
of models. Fortunately we can adapt recent advances          rules and statistical divergence.
in Bayesian inference under model misspecification to
help optimise learning with respect to L’s task.             Definition 2 (Proper Scoring Rule) The function
                                                             s : X × P is a strictly proper scoring rule provided its
                                                             difference function D satisfies
2.1   Bayesian Inference under model
                                                               D(f0 k f ) = Ex∼f0 [s(x, f (·))] − Ex∼f0 [s(x, f0 (·))]
Bayesian inference under model misspecification has
recently been formalised (Walker, 2013; Bissiri et al.,      D(f1 k f2 ) ≥ 0, D(f k f ) = 0 for all f, f1 , f2 ∈ P(x)
2016) and represents a growing area of research, see                                                     Z              
Watson and Holmes (2016); Jewson et al. (2018); Miller        P(x) :=        f (x) : f (x) ≥ 0 ∀x ∈ X ,       f (x)dx = 1 ,
and Dunson (2018); Lyddon et al. (2018); Grünwald                                                         X
et al. (2017); Knoblauch et al. (2019) to name but a
few. Traditional Bayes rule updating in this context         Where the function D measures a distance between
can be seen as an approach that learns about the pa-         two probability distributions. s(x, f ) arises as the di-
rameters of the model that minimises the logarithmic         vergence and is minimised when f0 = f (Gneiting and
score, or equivalently, the Kullback-Leibler divergence      Raftery, 2007; Dawid, 2007). A further advantage of
Foundations of Bayesian Learning from Synthetic Data

this representation is that it allows for the minimisation              rule updating. The predictive distribution associated
of D(f0 k ·) using only samples from F0 ,                               with such a posterior and the model fθ is:
       arg min D(f0 k f ) = arg min Ex∼f0 [s(x, f (·))]                           p` (x|z1:m ) = fθ (x)π ` (θ|z1:m )dθ    (7)
         f ∈F                        f ∈F
                      1   X
             ←−−−−              s(xi , f (·)),   xi ∼ F0       (5)
                n→∞   n                                                 3.2   Robust Bayes and dealing with outliers

Henceforth, we define any concepts of closeness (or                     In the absence of the ability to correctly model the
                                                                        S-DGP, robust statistics (see e.g. Berger et al., 1994)
‘usefulness’) in terms of a chosen divergence D and
associated scoring rule s, where the approximating                      provide an alternative option to guard against artefacts
density f is given by the predictive inferences resulting               of the generated synthetic data. We can gain increased
from synthetic z1:m ∼ Gε,δ . Given that inference using                 robustness in our learning procedure to data z1:m by
Gε,δ is no longer able to exactly capture f0 , L can use                changing the loss function `(z, fθ ) used for inference in
this notion of closeness to define what aspects of F0 they              Eq. (6). We consider two alternative loss functions to
are most concerned with capturing. The importance                       the standard logarithmic score underpinning standard
of this specification is illustrated in Section 4.                      Bayesian statistics,

                                                                            `w (z, fθ ) := −w log fθ (z)                        (8)
3     Improved learning from the S-DGP                                                       1
                                                                           `(β) (z, fθ ) :=         fθ (y)β+1 dy − fθ (z)β .    (9)
                                                                                            β+1                   β
The classical assumptions underlying statistics are that
minimising the KLD is the optimal way to learn about                    Here, `w (z, fθ ) introduces a learning parameter w > 0
the DGP, and that more observations provide more                        into the Bayesian update (e.g. Lyddon et al., 2018;
information about this underlying DGP; such logic does                  Grünwald et al., 2017; Miller and Dunson, 2018; Holmes
not necessarily apply here. L wishes to learn about the                 and Walker, 2017). Down-weighting, w < 1 will gener-
private DGP, F0 , but must rely on observations from the                ally produce a less confident posterior than in the case
S-DGP Gε,δ to do so. In this section we acknowledge this                of traditional Bayes’ rule, with a greater dependence
setting to propose a framework for improved learning                    on the prior. Conversely, w > 1 will have the opposite
from synthetic data. In so doing we pose the following                  effect. The value of w can have ramifications for infer-
question and detail our solutions in turn: Given the                    ence and prediction (Rossell and Rubio, 2018; Grün-
scoring criteria D, is θGKLD
                              the best the learner can do?              wald et al., 2017). However, we note that as the sample
                                                                        size grows using the weighted likelihood posterior will
1. Can the robustness of the learning procedure be im-                  still learn about θG  ∗
                                                                                                if w is fixed. Choosing w = e/m
   proved to better approximate F0 by acknowledging                     instead can be seen to average the log-likelihood, with
   the misspecification and outlier prone nature of z1:m ?              e providing a notion of effective sample size.
2. Starting from the prior predictive, p(x), for a given
                                                                        Alternatively, minimising `(β) (x, f (·)) in expectation
   learning method, when does learning using z ∼ Gε,δ
                                                                        over the DGP is equivalent to minimising the β-
   stop improving inference for F0 (x)? That is, when
                                                                        divergence (βD) (Basu et al., 1998). Therefore, analo-
    Ez [D (f0 (·) k p(· | z1:j+1 ))] > Ez [D (f0 (·) k p(· | z1:j ))]   gously to the KLD and the log-score, using `(β) (x, f (·))
                                                                        (Bissiri et al., 2016; Jewson et al., 2018; Ghosh and
                                                                        Basu, 2016) produces a Bayesian update targeting:
3.1     General Bayesian Inference
                                                                                   θGβD   : = arg min βD (gε,δ (·) k fθ ) .    (10)
In order to address these issues we adopt a general                                   ε,δ
Bayesian, minimum divergence paradigm for inference
(Bissiri et al., 2016; Jewson et al., 2018) inspired by                 As β → 0, then βD → KLD, but as β increases it
model misspecification, where L can coherently update                   provides increased robustness through skepticism of
beliefs about their model parameter θ from prior π(θ)                   new observations relative to the prior. We demonstrate
to posterior π(θ | z1:m ) using:                                        the robustness properties of the βD in some simple
                                                                        scenarios in the Supplementary material (see A.1) and
                                                                        refer the reader to e.g. Knoblauch et al. (2019, 2018)
                      π̃(θ) exp (− i=1 `(zj , fθ ))
   π (θ | z1:m ) ∝ R              Pm                  , (6)             for further examples. We note there are many possible
                     π̃(θ) exp (− i=1 `(zj , fθ )) dθ
                                                                        divergences providing greater robustness properties
where `(z, fθ ) is the loss function used by L for infer-               than the KLD, e.g. Wasserstein or Stein discrepancy
ence. Note that in this formulation, the logarithmic                    (Barp et al., 2019), but for our exposition we focus on
score `0 (z, fθ ) = − log fθ (z) recovers traditional Bayes             the βD for its convenience and simplicity.
Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

A key difference between the two robust loss func-            We provide a proposition that says using more data
tions considered above is that while `w (z, fθ ) down-        and approaching the limit, θGKLD
                                                                                                is not necessarily the
weights the log-likelihood of each observation equally,       optimal target to learn about according to criteria D.
`(β) (x, f (·)) does so adaptively, based on how likely
the new observation is under the current inference (Ci-       Proposition 1 (Suboptimality of the S-DGP)
chocki et al., 2011). It is this adaptive down-weighting      For S-DGP Gε,δ , model fθ (·), and divergence D, there
that allows the βD to target a different limiting param-      exists prior π̃(θ), private DGP F0 and at least one
eter to θGKLDε,δ
                 . This, in particular, allows the βD to be   value of 0 ≤ m < ∞ such that,
robust to outliers and/or heavy tailed contaminations.                                                  
As a result, we believe that                                        T`0 (m; D, f0 , gε,δ ) ≤ D F0 k fθKLD ,     (11)
             D F0 k fθβD < D F0 k fθKLD
                     Gε,δ                Gε,δ                 where θGKLD
                                                                           := arg minθ∈Θ KLD(Gε,δ k fθ ) and T`0 is the
                                                              learning trajectory of the Bayesian posterior predictive
across a wide range of S-DGPs, i.e. the βD minimising
                                                              distribution (using `0 ) based on (synthetic) data z1:m ,
approximation to Gε,δ is a better approximation of F0
                                                              see Eq. (7).
than the KLD minimising approximation. Proving
that this is the case in general proves complicated           The proof of Proposition 1 involves a simple counter
and is hampered by the intractability both of the βD          example in which the prior is a better approximation
minimisation and of many popular S-DGP’s. However,            to F0 according to divergence D than Gε,δ . While
we further justify this claim in A.1.2 where we show          trivial, this could reasonably occur if L has strong, well-
that for the prevalent Laplace DP mechanism, this             calibrated expert belief judgements, or if they have a
holds uniformly over the privacy level parameterised          considerable amount of their own data before beginning
by the scaling parameter of the Laplace distribution λ        to incorporate synthetic data. Furthermore, we argue
for D = KLD.                                                  next that by considering this learning trajectory path
A strength of the βD is that, unlike standard robust          between the prior and the S-DGP Gε,δ , in terms of the
methods using heavier tailed models or losses (Berger         number of synthetic observations m, it is possible to
et al., 1994; Huber and Ronchetti, 1981; Beaton and           get even ‘closer’ to F0 in terms of a chosen D.
Tukey, 1974), `(β) (x, f (·)) does not change the model       Changing the divergence used for inference as suggested
used for inference. In the absence of any specific knowl-     in Section 3.2 changes these trajectories by chang-
edge about the S-DGP, updating using the βD maintains         ing their limiting parameter. However, Proposition
the model L would have used to estimate F0 , but up-          1, which considered learning minimising the KLD, can
dates its parameters robustly. This also has advantages       equally be shown for learning minimising the βD (see
in the data combination scenario where L is combining         A.2.1). In the following sections we talk generally
inferences from their own private data xL  1:nL with syn-     about optimising such a trajectory for a given learn-
thetic data z1:m . They can maintain the same model           ing method, before focusing on comparisons in these
for both datasets, with the same model parameters,            methods and the resulting trajectories in Section 4.
yet update robustly about z1:m whilst still using the
log-score for xL
               1:nL (i.e. to condition their prior π̃(θ)).
                                                              3.4   Optimising the Learning Trajectory
3.3   The Learning Trajectory                                 The insights from the previous section raise two ques-
                                                              tions for a learner L using synthetic data:
The concept of closeness provided by D allows us to
consider how L’s approximation to F0 changes as more
data is collected from the S-DGP. To do so we consider        1. Is synthetic data able to improve L’s inferences
the ‘learning trajectory’, a function in m of the expected       about F0 according to divergence D? If so,
divergence to F0 of inference using m observations from       2. What is the optimal quantity to use for getting the
Gε,δ :                                                           learner closest to F0 according to D?

   T` (m; D, f0 , gε,δ ) = Ez D f0 (·) k p` (· | z1:m ) ,     Both questions can be solved by the estimation of

where p` (· | z1:m ) is the general Bayesian posterior                   m∗ := arg min T` (m; D, f0 , gε,δ ) ,      (12)
predictive distribution (using `) based on (synthetic)                          0≤m≤M

data z1:m . This ‘learning trajectory’ traces the path
taken by Bayesian inference from its prior predictive         However, clearly the learner never has access to the
(m = 0) towards the synthetic data’s DGP under in-            data generating density. Instead we take advantage of
creasing amounts of data (e.g. Figure 1; Section A.4).        the representation of proper scoring rules and propose
Foundations of Bayesian Learning from Synthetic Data

using a ‘test set’ x01:N ∼ F0 to estimate                     study. Here we recommend that alongside releasing
                                                              synthetic data, K optimises the learning trajectory
                       B N
                   1 1 XX                   (b)               themselves, under some default model, loss and prior
    m̂ : = arg min           s(x0j , p` (·|z1:m ))    (13)
           0≤m≤M   N B                                        setting by repeatedly partitioning x1:n into test and
                                                              training sets. For example, when releasing classification
        with {z1:m }b1:B ∼ Gε,δ .                             data, K could release an m̂ associated with logistic
                                                              regression and BART, for the log-score, under some
As such we use a small amount of data from F0 to guide        vaguely informative priors, providing learners an idea
the synthetic data inference towards F0 . We consider         of what to expect in terms of utility from the synthetic
two procedures for doing so: Firstly, by tailoring to         data. Whilst this is less tailored to any specific inference
L’s specific inference problem, and when this is not          problem, it still allows K to communicate a broad
possible, putting the onus on K to evaluate the general       measure of the quality of its released data for learning
ability of their S-DGP to capture F0 . Before so doing        about F0 , and is advisable given our results.
the following remark briefly considers optimising the
learning trajectory for a specific stream of data rather
                                                              3.5   Posthoc improvement through averaging
than average over the S-DGP.
                                                              Once m̂ has been estimated, the question then remains
Remark 1 We note that although previously the learn-          of how to conduct inference given m̂. In particular,
ing trajectory was defined to average across samples          if more synthetic data is available (e.g. m̂
Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

justments discussed in Section 3.2. In order to draw         A.6.1.1 for explicit formulations):
comparisons between these methods, we study the tra-
jectories’ dependence on values spanning a grid of data      1. The standard likelihood adjusted with an additional
quantities nL and m, robustness parameters w and β,             reweighting parameter w as in Eq. (8).
prior values, and the parameters of chosen DP mecha-         2. The posterior under the βD loss as in Eq. (9).
nisms (see A.6.3 for full experimental specifications).      3. The ‘Noise-Aware’ likelihood where the S-DGP can
The varying amounts of non-private data available to            be tractably modelled using the Normal-Laplace
L were used to construct increasingly informative pri-          convolution (Reed, 2006; Amini and Rabbani, 2017).
ors π̃(θ) = π(θ | xL1:nL ) through the use of standard
Bayesian updating, as robustness is not required when
                                                             4.1.1   Results and Discussion
learning using data drawn from F0 . Learning trajec-
tories are then estimated utilising an unseen dataset        We observe that three different categories of learning
x01:N (mimicking either K’s data or some subset of L’s       trajectory occur across the models; these are illustrated
data not used in training).                                  in the ‘branching’ plots in Figure 1 (explained and
To this end we use optimised MCMC sampling schemes           analysed further in A.6.4 and A.6.5):
(e.g. Hoffman and Gelman, 2014) to sample from all
of the considered posteriors in each experiment’s case;      1. The prior π̃ is sufficiently inaccurate or uninforma-
drawing comparisons across the grid laid out above and          tive (in this case due to low nL ) such that the syn-
repeating experiments to mitigate any sources of noise.         thetic data continues to be useful across the range
This results in an extensive computational task, made           of m we consider. As a result the learning trajectory
feasible through a mix of Julia’s Turing PPL (Ge et al.,        is a monotonically decreasing curve in the criteria
2018), MLJ (Blaom et al., 2020) and Stan (Carpenter             of interest.
et al., 2017).                                               2. A turning point is observed; synthetic data initially
                                                                brings us closer to F0 before the introduction of
The majority of the experiments are carried out with            further synthetic observations moves the inference
ε = 6, which is seen to be a realistic value respective of      away. We see that in the majority of cases these
practical applications (Lee and Clifton, 2011; Erlings-         trajectories lie under the limiting KLD and βD ap-
son et al., 2014; Tang et al., 2017; Differential Privacy       proximations to Gε,δ demonstrating the efficacy of
Team at Apple, 2017) and upon observation of the re-            ‘optimising the learning trajectory’ through the ex-
lationship between privacy and misspecification shown           istence of these optimal turning points.
in Figure A.4. We evaluate our experiments according         3. The final scenario occurs under a sufficiently infor-
to the divergences and scoring rules discussed in detail        mative prior π̃ (here due to a large nL ) such that
in A.6.2, presented across a range of the figures in the        synthetic data is not observably of any use at all; it
main text and supplementary material.                           can be seen to immediately cause model performance
                                                                to deteriorate.
4.1   Simulated Gaussian Models
                                                             We can further quantify the turning points which are
We first introduce a simple but illustrative simulated       perhaps the most interesting characteristic of these
example in which we infer the parameters of a Gaus-          experiments. To do this we formulate bootstrapped
sian model fθ = N (µ, σ 2 ) where θ = (µ, σ 2 ). We place    averages of the number of ‘effective real samples’ that
conjugate priors on θ with σ 2 ∼ InverseGamma(αp , βp )      correspond to the estimated optimal quantity of syn-
and µ ∼ N (µp , σp × σ) respectively. We consider x1:n       thetic data. This is done by comparing these minima
drawn from DGP F0 = N (0, 12 ) and adopt the Laplace         with the black curves in the ‘branching’ plots represent-
mechanism (Dwork et al., 2014) to define our S-DGP.          ing the learning trajectory under an increasing nL and
This mechanism works by perturbing samples drawn             m = 0 (see A.6.4 for more details). These calculations
from the DGP with noise drawn from the Laplace dis-          are shown in Figure 2 (and discussed further in A.6.5).
tribution of scale λ, calibrated via the sensitivity S of
                                                             In general, we observe a significant increase in perfor-
the DGP in order to provide (ε, 0)-DP per the Laplace
                                                             mance from the βD (see Figures 1, 3), indicated by
mechanism’s definition with ε = S/λ. To achieve finite-
                                                             its proximity to even the noise-aware model at lower
ness of S in this case, we adjust our model to be that
                                                             values of nL , alongside more modest improvements
of a truncated Gaussian; restricting its range to ±3σ
                                                             from reweighting methods. The βD achieves more de-
to allow for meaningful ε’s to be calculated under the
                                                             sirable minimum-trajectory log score, KLD and Wasser-
Laplace mechanism (See A.3 for a proof of this result).
                                                             stein values compared to other model types; exhibiting
We then compare and evaluate the empirical perfor-           greater robustness to larger amounts of synthetic data
mance of the competing methods defined below (see            where other approaches lose out significantly.
Foundations of Bayesian Learning from Synthetic Data

                                                                                                  w = 0.5                                                                                                 β = 0.5                                  Number of
                                           0.3                                                                                                                                                                                                     Real Samples


               Kullback-Leibler Distance


                                           0.2                                                                                                                                                                                                          10









                                                               0                 25                 50                    75                   100             0                           25               50               75            100
                                                                                                                                 Total Number of Samples (Real + Synth)

Figure 1: Shows how the KLD to F0 changes as we add more synthetic data, starting with increasing amount of
real data. This demonstrates the effectiveness of the optimal discovered βD configuration compared to the the
closest alternative traditional model in terms of performance with down-weighting w = 0.5; the black dashed line
illustrates KLD(F0 k fθ∗ ) for θ∗ = θGKLD
                                           (left) and θ∗ = θGβD
                                                                  (right), representing the approximation to F0 given
an infinite sample from Gε,δ under the two learning methods, and exhibiting the superiority of the βD.

                                                                                                  AUROC                                                                                                   Log Score
               Maximum Effective Real Sample Gain
                Throough the Use of Synthetic Data

                                                     50                                                                                              400
                                                                                                                                                                                                                                                  Model Type
                                                                                                                                                                                                                                                      w = 0.5
                                                                                                                                                                                                                                                      w = 1 (Naive)
                                                                                                                                                                                                                                                      β = 0.25

                                                      0                                                                                              200                                                                                              β = 0.5
                                                                                                                                                                                                                                                      β = 0.75

                                                     -50                                                                                                   0

                                                                   0            1000             2000              3000                4000                        0                      1000        2000            3000        4000
                                                                                                                                        Number of Real Samples

Figure 2: Shows the effective number of real samples gained through optimal m̂ synthetic observations alongside
varying amounts of real data usage with respect to the AUROC and log-score performance criteria. These
are calculated and presented here via bootstrapped averaging under a logistic regression model learnt on the
Framingham dataset. The amount of effective real samples is significantly affected by the learning task’s criteria.


                             0.3                                                                                                      Model                                                                                                      Configuration
  Wasserstein Distance

                                                                                                                                                                                                                                                     w = 0 (No Syn)

                                                                                                                                           w = 0 (No Syn)
                                                                                                                                                                               0.83                                                                  w = 0.5
                                                                                                                                           w = 0.5
                                                                                                                                                                                                                                                     w = 1 (Naive)
                             0.2                                                                                                           w = 1 (Naive)
                                                                                                                                                                                                                                                     β = 0.25
                                                                                                                                           β = 0.5                             0.82
                                                                                                                                                                                                                                                     β = 0.5
                                                                                                                                           β = 0.8
                                                                                                                                                                                                                                                     β = 0.75



                                                           0              50               100               150               200                                                    0             100                200               300
                                                                               Number of Synthetic Samples                                                                                        Number of Synthetic Samples

                                                                       (a) Simulated Gaussian, real n = 10                                                                                (b) UCI Heart Dataset, real n = 30 (10%)

Figure 3: Given a fixed real amount of data, we can compare model performances directly by focusing on one of
the ‘branches’ in the class of diagrams shown in Figures 1 & 4, to see that the βD’s performance falls between
that of the noise-aware model and the other models, exhibiting robust and desirable behaviour across a range
of β. Naïve and reweighting-based approaches fail to gain significantly over not using synthetic data (shown
by w = 0’s flat trajectory); the resampled model in the logistic case can also be seen to perform very poorly in
comparison to models that leverage the synthetic data.
Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

                                                              Referring to Figures 2, 3 and 4 we see that the learn-
                                                              ing trajectories observed in this more realistic example
                                                              mirror those observed in our simulated Gaussian ex-
                                                              periments. There are however some cases in which the
                                                              reweighted posterior outperforms the βD, and we see
                                                              large discrepancies in m̂ when comparing log score to
                                                              AUROC values, reinforcing the importance of carefully
                                                              defining the learning task to prioritise.
                                                              Additionally, experiments using synthetic data from
                                                              a GAN offer the unique observation that performance
                                                              can actually improve as ε decreases. We believe this
Figure 4: This plot illustrates an interesting and impor-     is due to potential mode collapse in the GAN learning
tant observation made when varying ε for a GAN based          process on imbalanced datasets, and concentrations of
model, we observe that there is a privacy ‘sweet-spot’        realisations of Gε,δ as the injected noise increases such
around ε = 1 whereby more private data performs bet-          that a small number of synthetic samples can actually
ter than less private data (see the curves for ε = 100        be more representative of F0 than even the real data.
which essentially represent non-private data).1               This effect is short-lived as more synthetic observations
                                                              are used in learning, as presumably these samples then
                                                              become over-represented through the posterior distribu-
4.2     Logistic Regression                                   tion and performance begins to deteriorate, see Figure
                                                              4 for performance comparisons across ε.
We now move on to a more prevalent and practical class
of models that also exhibit the potentially dangerous
behaviours of synthetic data in real-world contexts,          5    Conclusions
via datasets concerning subjects that have legitimate
potential privacy concerns. Namely, we build logistic         We consider foundations of Bayesian learning from
regression models for the UCI Heart Disease dataset           synthetic data that acknowledge the intrinsic model
(Dua and Graff, 2017) and the Framingham Cohort               misspecification and learning task at hand. Contrary
dataset (Splansky et al., 2007). Clearly, we are now only     to traditional statistical inferences, conditioning on in-
able to access the empirical distribution F n∗ , where n∗     creasing amounts of synthetic data is not guaranteed to
is the total amount of data present in each dataset. We       help you learn about the true data generating process
use x1:nT to train an instance of the aforementioned          or make better decisions. Down-weighting the infor-
PATE-GAN Gε,δ and keep back x1:(n∗ −nT ) for evaluation;      mation in synthetic data (either using a weight w or
we then draw synthetic data samples z1:m ∼ Gε,δ . As          divergence βD) provides a principled approach to robust
before, we investigate how the learning trajectories are      optimal information processing and warrants further
affected across the experimental parameter grid.              investigation. Further work could consider augmenting
                                                              these general robust techniques with tailored adjust-
Again, we consider learning using `w and `β applied           ments to inference based on the specific synthetic data
to the logistic regression likelihood, fθ (see A.6.1.2 for    acknowledging the discrepancies between its generating
exact formulations). In this case we cannot formulate         density and the Learner’s true target.
a ‘Noise-Aware’ model due to the black-box nature
of the GAN, highlighting the reality of the model mis-
specification setting we find ourselves in aside from         Acknowledgements
simple or simulated examples. We can instead define a
‘resampled’ model that recycles the real data used in         HW is supported by the Feuer International Scholar-
formulating the prior.                                        ship in Artificial Intelligence. JJ was funded by the
                                                              Ayudas Fundación BBVA a Equipos de Investigación
4.2.1    Results and Discussion                               Cientifica 2017 and Government of Spain’s Plan Na-
                                                              cional PGC2018-101643-B-I00 grants whilst working
Here the learning trajectories are defined with respect       on this project. SJV is supported by The Alan Turing
to the AUROC as well as the log score; whilst not tech-       Institute (EPSRC grant EP/N510129/) and the Uni-
nically a divergence, this gives us a decision theoretic      versity of Warwick IAA funding. CH is supported by
criteria to quantify the closeness of our inference to F0 .   The Alan Turing Institute, Health Data Research UK,
    This plot exhibits the effect under the βD model on the   the Medical Research Council UK, the Engineering and
Framingham dataset with β = 0.5, but is observable across     Physical Sciences Research Council (EPSRC) through
all model types in both AUROC and log score.                  the Bayes4Health programme Grant EP/R018561/1,
Foundations of Bayesian Learning from Synthetic Data

and AI for Science and Government UK Research and          Anthony D Blaom, Franz Kiraly, Thibaut Lienart, Yian-
Innovation (UKRI).                                           nis Simillides, Diego Arenas, and Sebastian J Vollmer.
                                                             MLJ: A Julia package for composable Machine Learn-
References                                                   ing. July 2020.

Charu C Aggarwal and Philip S Yu. A condensation           Bob Carpenter, Andrew Gelman, Matthew D Hoffman,
 approach to privacy preserving data mining. In Ad-          Daniel Lee, Ben Goodrich, Michael Betancourt, Mar-
 vances in Database Technology - EDBT 2004, pages            cus Brubaker, Jiqiang Guo, Peter Li, and Allen Rid-
 183–199. Springer Berlin Heidelberg, 2004.                  dell. Stan: A probabilistic programming language.
                                                            Journal of statistical software, 76(1), 2017.
Zahra Amini and Hossein Rabbani. Letter to the editor:
                                                           Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari.
  Correction to “the normal-laplace distribution and
                                                            Generalized alpha-beta divergences and their appli-
  its relatives”. Communications in Statistics-Theory
                                                            cation to robust nonnegative matrix factorization.
  and Methods, 46(4):2076–2078, 2017.
                                                            Entropy, 13(1):134–170, 2011.
Alessandro Barp, Francois-Xavier Briol, Andrew Dun-        Graham Cormode, Tejas Kulkarni, and Divesh Srivas-
  can, Mark Girolami, and Lester Mackey. Minimum             tava. Answering range queries under local differential
  stein discrepancy estimators. In Advances in Neural        privacy, 2019.
  Information Processing Systems, pages 12964–12976,
  2019.                                                    A Philip Dawid. The geometry of proper scoring rules.
                                                             Annals of the Institute of Statistical Mathematics, 59
Ayanendranath Basu, Ian R Harris, Nils L Hjort, and         (1):77–93, 2007.
  MC Jones. Robust and efficient estimation by min-
                                                           Yves-Alexandre de Montjoye, Sébastien Gambs, Vin-
  imising a density power divergence. Biometrika, 85
                                                             cent Blondel, Geoffrey Canright, Nicolas de Cordes,
 (3):549–559, 1998.
                                                             Sébastien Deletaille, Kenth Engø-Monsen, Manuel
Albert E Beaton and John W Tukey. The fitting of             Garcia-Herranz, Jake Kendall, Cameron Kerry, Gau-
  power series, meaning polynomials, illustrated on          tier Krings, Emmanuel Letouzé, Miguel Luengo-Oroz,
  band-spectroscopic data. Technometrics, 16(2):147–         Nuria Oliver, Luc Rocher, Alex Rutherford, Zbigniew
  185, 1974.                                                 Smoreda, Jessica Steele, Erik Wetter, Alex Sandy
                                                             Pentland, and Linus Bengtsson. On the privacy-
James O Berger. Statistical Decision Theory and
                                                             conscientious use of mobile phone data. Sci Data, 5:
  Bayesian Analysis. Springer Science & Business
                                                             180286, December 2018.
  Media, March 2013.
                                                           Differential Privacy Team at Apple. Learning with
James O Berger, Elías Moreno, Luis Raul Pericchi,
                                                             privacy at scale. 2017.
  M Jesús Bayarri, José M Bernardo, Juan A Cano,
  Julián De la Horra, Jacinto Martín, David Ríos-Insúa,    Dheeru Dua and Casey Graff. UCI machine learning
  Bruno Betrò, et al. An overview of robust bayesian        repository, 2017.
  analysis. Test, 3(1):5–124, 1994.                        Cynthia Dwork and Jing Lei. Differential privacy and
Robert H Berk et al. Limiting behavior of posterior dis-     robust statistics. In Proceedings of the forty-first
  tributions when the model is incorrect. The Annals        annual ACM symposium on Theory of computing,
 of Mathematical Statistics, 37(1):51–58, 1966.              pages 371–380, 2009.
                                                           Cynthia Dwork, Frank McSherry, Kobbi Nissim, and
José M Bernardo and Adrian FM Smith. Bayesian
                                                            Adam Smith. Calibrating noise to sensitivity in
  theory, 2001.
                                                            private data analysis. In Theory of cryptography
Garrett Bernstein and Daniel R Sheldon. Differentially      conference, pages 265–284. Springer, 2006.
 private bayesian inference for exponential families. In   Cynthia Dwork, Aaron Roth, et al. The algorithmic
 Advances in Neural Information Processing Systems,         foundations of differential privacy. Foundations and
 pages 2919–2929, 2018.                                     Trends in Theoretical Computer Science, 9(3-4):211–
Garrett Bernstein and Daniel R Sheldon. Differentially      407, 2014.
 private bayesian linear regression. In Advances in        Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Ko-
 Neural Information Processing Systems, pages 523–           rolova. Rappor: Randomized aggregatable privacy-
 533, 2019.                                                  preserving ordinal response. In Proceedings of the
PG Bissiri, CC Holmes, and Stephen G Walker. A               2014 ACM SIGSAC conference on computer and
 general framework for updating belief distributions.        communications security, pages 1054–1067, 2014.
 Journal of the Royal Statistical Society: Series B        Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: a
 (Statistical Methodology), 2016.                            language for flexible probabilistic inference. In In-
Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

  ternational Conference on Artificial Intelligence and    William J Reed. The normal-laplace distribution and
  Statistics, AISTATS 2018, 9-11 April 2018, Playa          its relatives. In Advances in distribution theory, order
  Blanca, Lanzarote, Canary Islands, Spain, pages           statistics, and inference, pages 61–74. Springer, 2006.
  1682–1690, 2018. URL http://proceedings.mlr.             Luc Rocher, Julien M Hendrickx, and Yves-Alexandre
  press/v84/ge18b.html.                                      De Montjoye.        Estimating the success of re-
Abhik Ghosh and Ayanendranath Basu. Robust bayes             identifications in incomplete datasets using gener-
 estimation using the density power divergence. An-          ative models. Nature communications, 10(1):1–9,
 nals of the Institute of Statistical Mathematics, 68        2019.
 (2):413–437, 2016.                                        Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Ed-
Tilmann Gneiting and Adrian E Raftery. Strictly              uardo de Leon, Anuj Desai, and Joshua Allen. Differ-
  proper scoring rules, prediction, and estimation.          entially private synthetic data: Applied evaluations
  Journal of the American Statistical Association, 102       and enhancements. arXiv preprint arXiv:2011.05537,
  (477):359–378, 2007.                                       2020.
Peter Grünwald, Thijs Van Ommen, et al. Inconsistency      David Rossell and Francisco J Rubio. Tractable
  of bayesian inference for misspecified linear models,     bayesian variable selection: beyond normality. Jour-
  and a proposal for repairing it. Bayesian Analysis,       nal of the American Statistical Association, 113(524):
  12(4):1069–1103, 2017.                                    1742–1758, 2018.
Matthew D Hoffman and Andrew Gelman. The no-               Greta Lee Splansky, Diane Corey, Qiong Yang, Larry D
 u-turn sampler: adaptively setting path lengths in          Atwood, L Adrienne Cupples, Emelia J Benjamin,
 hamiltonian monte carlo. J. Mach. Learn. Res., 15           Ralph B D’Agostino Sr, Caroline S Fox, Martin G
 (1):1593–1623, 2014.                                        Larson, Joanne M Murabito, et al. The third gener-
                                                             ation cohort of the national heart, lung, and blood
CC Holmes and SG Walker. Assigning a value to
                                                             institute’s framingham heart study: design, recruit-
 a power likelihood in a general bayesian model.
                                                             ment, and initial examination. American journal of
 Biometrika, 104(2):497–503, 2017.
                                                            epidemiology, 165(11):1328–1335, 2007.
Peter J Huber and EM Ronchetti. Robust statistics, se-
                                                           Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xue-
  ries in probability and mathematical statistics, 1981.
                                                             qiang Wang, and Xiaofeng Wang. Privacy loss in
Jack Jewson, Jim Smith, and Chris Holmes. Princi-            apple’s implementation of differential privacy on ma-
  ples of Bayesian inference using general divergence        cos 10.12. arXiv preprint arXiv:1709.02753, 2017.
  criteria. Entropy, 20(6):442, 2018.                      The Royal Society. Privacy Enhancing Technologies:
James Jordon, Jinsung Yoon, and Mihaela van der             protecting privacy in practice. 2019.
  Schaar. Pate-gan: Generating synthetic data with         UK HDR Alliance. Trusted research environments
  differential privacy guarantees. In International Con-    (TRE).
  ference on Learning Representations, 2018.                uploads/2020/07/200723-Alliance-Board_
Jeremias Knoblauch, Jack Jewson, and Theodoros              Paper-E_TRE-Green-Paper.pdf, 2020. Accessed:
  Damoulas. Doubly robust Bayesian inference for            2020-10-15.
  non-stationary streaming data using β-divergences.       Stephen G Walker. Bayesian inference with misspecified
  In Advances in Neural Information Processing Sys-          models. Journal of Statistical Planning and Inference,
  tems (NeurIPS), pages 64–75, 2018.                         143(10):1621–1633, 2013.
Jeremias Knoblauch, Jack Jewson, and Theodoros             James Watson and Chris Holmes. Approximate models
  Damoulas. Generalized variational inference. arXiv         and robust decisions. Statistical Science, 31(4):465–
  preprint arXiv:1904.02063, 2019.                           489, 2016.
Jaewoo Lee and Chris Clifton. How much is enough?          Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and
  choosing ε for differential privacy. In International      Jiayu Zhou. Differentially private generative adver-
  Conference on Information Security, pages 325–340.         sarial network. arXiv preprint arXiv:1802.06739,
  Springer, 2011.                                            2018.
SP Lyddon, CC Holmes, and SG Walker. General               Jun Zhang, Graham Cormode, Cecilia M Procopiuc,
  bayesian updating and the loss-likelihood bootstrap.       Divesh Srivastava, and Xiaokui Xiao. PrivBayes:
  Biometrika, 2018.                                          Private data release via bayesian networks. ACM
Jeffrey W Miller and David B Dunson. Robust bayesian         Trans. Database Syst., 42(4):1–41, October 2017.
  inference via coarsening. Journal of the American
  Statistical Association, pages 1–13, 2018.
You can also read