Growing synthetic data through differentially-private vine copulas

Page created by Christopher Brooks
 
CONTINUE READING
Growing synthetic data through differentially-private vine copulas
Proceedings on Privacy Enhancing Technologies ; 2021 (3):122–141

Sébastien Gambs, Frédéric Ladouceur, Antoine Laurent, and Alexandre Roy-Gaumond*

Growing synthetic data through
differentially-private vine copulas
Abstract: In this work, we propose a novel approach for           data portal3 and the American data portal4 , respec-
the synthetization of data based on copulas, which are            tively each with 83 460 and 209 765 datasets. The open
interpretable and robust models, extensively used in the          data movement has also led cities to open more and
actuarial domain. More precisely, our method COPULA-              more data. Concrete examples are the cities of Mon-
SHIRLEY is based on the differentially-private training           tréal5 and Toronto6 , with around 350 datasets each.
of vine copulas, which are a family of copulas allow-                  The major drawback of this augmentation in the
ing to model and generate data of arbitrary dimensions.           availability of data is that it is accompanied by the in-
The framework of COPULA-SHIRLEY is simple yet flex-               crease in privacy risks for the individuals whose records
ible, as it can be applied to many types of data while            are contained in the shared data. Such privacy risks
preserving the utility as demonstrated by experiments             have been demonstrated many times in the literature,
conducted on real datasets. We also evaluate the pro-             even in the situation in which the data was supposed
tection level of our data synthesis method through a              to be anonymized. For instance, in the context of mo-
membership inference attack recently proposed in the              bility data, researchers have shown that any individual
literature.                                                       is unique in a population with respect to his mobil-
                                                                  ity behaviour simply given three to five points of in-
Keywords: Synthetic data, Copulas, Differential privacy,
                                                                  terests (i.e., frequently visited locations) [17]. Follow-
Privacy evaluation.
                                                                  ing the same approach but in the context of financial
DOI 10.2478/popets-2021-0040                                      data, another study has demonstrated that five credit
Received 2020-11-30; revised 2021-03-15; accepted 2021-03-16.     card transactions are sufficient to uniquely pinpoint an
                                                                  individual [18]. Finally, even data that appears to be
                                                                  “harmless” at first sight such as movie ratings could
1 Introduction                                                    lead to privacy risks as illustrated by Narayanan and
                                                                  Shmatikov [48, 59].
With the advent of Big Data and the widespread de-                     To address these issues, multiple anonymization
velopment of machine learning, the sharing of datasets            techniques have been developed in the last two decades.
has become a crucial part of the knowledge discovery              Among these, there has recently been a growing interest
process. For instance, Kaggle1 , a major actor in the             in generative models and synthetic data. A key prop-
data portal platforms, supports more than 19 000 pub-             erty of generative methods is their ability to mince even
lic datasets and the University of California at Irvine           more the link between identities and the data shared in
(UCI)2 repository is another important data portal with           comparison with other anonymization techniques. This
more 550 curated datasets that can be used for ma-                property is due to the summarization of the data to
chine learning purposes. Numerous public institutions             probabilistic representations that can be sampled to
also publicly share datasets such as the Canadian open            produce new profiles that are not associated with a par-
                                                                  ticular identity. Nonetheless, generative models as well
                                                                  as the synthetic data they produce, can still leak infor-
                                                                  mation about the dataset used to train the generative
Sébastien Gambs: UQAM, E-mail:
gambs.sebastien@uqam.ca
                                                                  model and thus lead to privacy breaches. In particular,
Frédéric Ladouceur: Ericsson Montréal, E-mail:                    a recent study by Stadler, Oprisanu and Troncoso [62]
ladouceu@stanford.edu                                             have demonstrated that generative models trained in
Antoine Laurent: UQAM, E-mail: lau-
rent.antoine@courrier.uqam.ca
*Corresponding Author: Alexandre Roy-Gaumond:
UQAM, E-mail: roy-gaumond.alexandre@courrier.uqam.ca              3   https://open.canada.ca/en/open-data
                                                                  4   https://www.data.gov/
1 https://www.kaggle.com/datasets                                 5   https://donnees.ville.montreal.qc.ca/
2 https://archive.ics.uci.edu/ml/datasets.php                     6   https://open.toronto.ca/
Growing synthetic data through differentially-private vine copulas   123

a non-private manner provide little protection against       ing data in a private manner. Most of these methods
inference attacks compared to simply publishing the          rely on generative models that condense the information
original data. They also showed that generative models       in probabilistic models, thus severing the link between
trained in a differentially-private manner did not im-       identity and data through randomness and abstrac-
prove the protection.                                        tion. Common generative models include Bayesian net-
     Thus, in addition to being based on a formal model      works [32], Markov models [63] and neural networks [4].
such as differential privacy, we believe that data synthe-   It is possible to roughly distinguish between two types
sis methods should be assessed with a privacy evalua-        of synthesis methods: partial and complete.
tion based on inference attacks to further quantify the           Partial data synthesis. Partial synthesis implies
privacy protection provided by the synthesis method.         that for a given profile, some attributes are fixed while
In this paper, we propose a copula-based approach to         the rest is sampled by using the generative model. As
differentially-private data synthesis. Our approach aims     a consequence partial synthesis requires both the access
at providing an answer to the following conundrum:How        to part of the original profiles and the trained generative
to generate synthetic data that is private yet realistic     model.
with respect to the original data?                                A seminal work on partial data synthesis is done by
     The proposed generative model is called COPULA-         Bindschadler, Shokri and Gunter in [11], in which they
SHIRLEY, which stands for COPULA-based genera-               develop a framework based on plausible deniability and
tion of SyntHetIc diffeRentialL(E)Y-private data. In a       differential privacy. The generative model used is based
nutshell, COPULA-SHIRLEY constructs a differentially-        on differentially private Bayesian networks capturing
private vine copulas model that privately summarizes         the joint probability distribution of the attributes. In a
the training data and can later be used to generate syn-     nutshell, the “seed-based” synthesis of new records takes
thetic data. We believe that our method has the follow-      a record from the original dataset and “updates” a ran-
ing strengths:                                               dom subset of its attributes through the model. Their
– Simple to use and deploy - due to the nature of the        approach releases a record only if it passes the privacy
     mathematical objects upon which it is based, which      test of being similar to at least k records from the orig-
     are mainly density functions.                           inal dataset, hence the plausible deniability. A notable
– Flexible - due to the highly customizable framework        drawback of this approach is the high computational
     of copulas and their inherent ability to model even     time and the large addition of noise, which makes it dif-
     complex distributions by decomposing them into a        ficult to scale to high-dimensional datasets as pointed
     combination of bivariate copulas.                       out in [15].
– Private - due to the use of differential privacy to             In contrast, complete synthesis methods output new
     train the generative model as well as the privacy       profiles directly from the generative model. Hereafter,
     test for membership inference.                          we focus on complete data synthesis methods.
                                                                  Statistical-based generation. The PrivBayes ap-
The outline of the paper is as follows. First, we re-        proach [77] has become one of the standards in the lit-
view the literature on differentially-private data syn-      erature for complete data synthesis through Bayesian
thesis methods in Section 2 before presenting the back-      networks. PrivBayes integrates differential privacy in
ground notions of differential privacy and copulas theory    each step of the construction of the Bayesian network
in Section 3. Afterwards, we describe in detail in Sec-      while optimizing the privacy budget. More precisely, the
tion 4 the synthesis based vine copulas that we propose      construction applies the Exponential mechanism on the
COPULA-SHIRLEY, followed by an experimental evalu-           mutual information between pairs of attributes and the
ation of the utility-privacy trade-off that it can achieve   Laplace mechanism to protect the conditional distribu-
on three real datasets in Section 5 before concluding.       tions [22]. Moreover, the privacy of the method relies
                                                             solely on the use of differential privacy and no inference
                                                             attacks are provided. In addition, the greedy construc-
                                                             tion of the Bayesian network is time consuming, which
2 Related work on private data                               makes it impractical for high-dimensional data.
  synthesis                                                       Gaussian models are widely used for data mod-
                                                             elling, independently of the properties of the data it-
Although the literature on anonymization methods is          self, as the structure of the data can be accurately
huge, hereafter we focus on the methods for synthesiz-       estimated with only the mean and covariance of the
Growing synthetic data through differentially-private vine copulas   124

data. Recently Chanyaswad, Liu and Mittal have pro-                GAN-based generation. Since the influential
posed a differentially-private method to generate high-       work of Goodfellow and collaborators [28] on Genera-
dimensional private data through the use of random            tive Adversarial Networks (GANs), this approach has
orthonormal projection (RON) and Gaussian models              been leveraged on for private data generation [24, 34,
named RON-Gauss [15]. RON is a dimensionality reduc-          67, 70, 75] to name a few. For instance, the work of Jor-
tion technique that lowers the amount of noise added to       don, Yoon and van der Schaar has introduced PATE-
achieve differential privacy and achieves the Diaconis-       GAN [34], which combines two previous differentially-
Freedman-Meckes (DFM) effect stating that “under              private machine learning techniques of machine learn-
suitable conditions, most projections are approximately       ing, namely Private Aggregation of Teacher Ensembles
Gaussian”. With the DFM effect, Gaussian generative           (PATE) [50] and Differentially Private Generative Ad-
models are then well suited to model the projected data       versarial Network (DPGAN) [74]. The combination of
and are easily made differentially private. The authors       these two is done by using a “student” layer that learns
have implemented the membership inference attack de-          from the differentially-private classifier PATE as the dis-
scribed in [58] as a privacy test and shown that with a       criminator in the DPGAN framework, thus enabling the
privacy budget  < 1.5 the success of the attack is no        training of both a differentially-private generator and
better than a random guess.                                   discriminator. The work of [34] advocates the privacy
     Copula-based generation. Comparable works to             of their synthetic records because of the use of differen-
our use copula functions as generative models [6, 36]. To     tial privacy but do not use any privacy tests to validate
the best of our knowledge, the first differentially-private   it. The main disadvantages of using techniques such as
synthesis method based on copulas is by Li, Xiong and         GANs are that they need a large quantity of data, a fine-
Jiang and is called DP-COPULA [36]. The authors have          grained tuning of their hyper parameters and they have
used copulas to model the joint distributions based on        a high complexity, which makes them inappropriate for
marginal distributions. They use differentially-private       a number of real-life applications.
histograms as marginal distributions and the Gaussian              Privacy evaluation. The amount of data synthe-
copula to estimate the joint distribution. No privacy         sis methods based on generative models trained in a
tests on their differentially-private synthetic data was      differentially private manner as grown significantly in
considered by the authors and much like RON-Gauss             recent years. However, a fundamental question to solve
[15], they simplify the structure to a Gaussian model.        when deploying these models in practice is what value
However, some authors have shown that some tail de-           of  to choose to provide an efficient protection. While
pendencies are not fully capture by Gaussian models,          there is currently no consensus on the “right value” of
which can lead to an important loss of information [38].      , often researchers set  ∈ (0, 2] but it can sometimes
In contrast, as described in Section 3.2, vine copulas,       be greater than  > 105 [37, 46, 50]. Moreover, a recent
which are the basis of COPULA-SHIRLEY, split the mul-         study has shown the importance of providing a privacy
tivariate function into multiple bivariate functions en-      assessment even when the model is trained in a differen-
abling the bivariate families of copulas to successfully      tial privacy manner [62]. In particular, the authors have
capture the dependence structure.                             demonstrated that differential privacy in model train-
     Tree-based generation. A vine copula is a struc-         ing does not provide uniform privacy protection and
ture with nested trees, which can thus be viewed as           that some models fail to provide any protection at all.
a tree-based generation model (TGM). Other PGMs               Thus, while achieving differential privacy is a first step
for synthetic data generation such as Reiter’s work [55]      for data synthesis, we believe that it is also important
and [14] rely on trees to model the conditional proba-        to complement them with methods that can be used
bilities between a variable Y and some other predictor        to evaluate the remaining privacy risks of sharing the
variables Xi , in which i varies from 1 to n, the number      synthetic data.
of attributes. More precisely, each leaf corresponds to a          To the best of our knowledge, very few works on
conditional distribution of Y given ai ≤ Xi ≤ bi . Such       data synthesis achieving differential privacy have also
an approach can be classified as partial data synthesis       proposed methods to assess the remaining privacy risks
due to the use of the predictor variables to form the         related to synthetic data with the exception of [11],
imputed synthetic Y . In comparison, the nested trees of      which assesses the plausible deniability level of the gen-
vine copulas have the advantages to provide fully syn-        erated data. Privacy can be assessed in multiple ways.
thetic data as well as deeper modelling of conditional        A common approach when using differential privacy,
distributions due to a more flexible representation.          which mitigates the risk of membership disclosure risk
Growing synthetic data through differentially-private vine copulas   125

by diminishing the contribution of any individual in the       based on the notation used in the book of Dwork and
model learned, is testing via membership inference at-         Roth [22].
tacks. A recent paper [30] proposed a model-agnostic
method to assess the membership disclosure risk by             Definition 3.1 (Differential privacy (bounded)). Let
quantifying the ease of distinguishing training records        x and y two databases of equal size such that they differ
from other records drawn from a disjoint set with the          at most by one record. A randomized mechanism M is
same distribution.                                             -differentially private such that for the space of possible
    In the machine learning context, the approach              outputs S ⊆ =(M) we have:
recently developed by Meehan, Chaudhuri and Das-
                                                                         Pr[M(x) ∈ S] ≤ exp() · Pr[M(y) ∈ S].
gupta [43] tackled the data-copying issue that can arise
from generative models. Data-copying is defined as a
                                                               This property ensures that the difference in the outputs
form of overfitting in which trained models output iden-
                                                               of the randomized mechanism is indistinguishable up
tical or very similar profiles to the training samples. The
                                                               to exp() for two databases that differ in at most one
proposed framework focuses on global and local detec-
                                                               profile. Here,  is the parameter defining the level of
tion of data-copying by measuring the distribution of
                                                               privacy. In particular, the smaller is the value of , the
distances between the synthetic profiles and their near-
                                                               higher is the protection offered due to the increase in
est neighbours in the training set as well as in a test set.
                                                               indistinguishability.
The main idea here is that a smaller distance between
                                                                    The (global) sensitivity ∆ of a function f is an im-
the synthetic profiles and the training ones compared
                                                               portant notion in differential privacy formalizing the
to the synthetic to test ones is an indication that the
                                                               contribution of a particular individual on the output
data generation process has a tendency to exhibit data-
                                                               of this function. More precisely, the sensibility charac-
copying.
                                                               terizes the answer to the following question: “What is
                                                               the maximal contribution that an individual can have
                                                               on the outcome of a particular function f ?”.
3 Preliminaries                                                     For an arbitrary function f , there exists many ways
                                                               to make it -differentially private. For instance, the
In this section, we review the background notions nec-         Laplacian mechanism [22] can be used to make a numer-
essary to the understanding of our work, namely dif-           ical function returning a real or a set of reals f : D → Rk
ferential privacy as the privacy model upon which our          -differentially private by drawing randomly from the
method is built as well as copulas and vine copulas as         Laplacian distribution Lap( ∆f  ).
the underlying generative model.                                    Another important concept in differential privacy is
                                                               the notion of the privacy budget, which bounds the total
                                                               amount of information leaked about the input database
3.1 Differential privacy                                       when applying several mechanisms on this database.
                                                               The application of several mechanisms do not always
Differential Privacy (DP) is a privacy model that aims         have the same impact on the privacy budget depending
at providing strong privacy guarantees with respect to         on whether these mechanisms are applied in a sequential
the amount of information that can leak about individ-         or parallel manner. The following theorems are inherent
uals that are part of a database [22]. In a nutshell, DP       properties of differential privacy and help to adequately
ensures that the contribution of a particular profile will     manage the privacy budget.
have a limited impact on the outcome of a computation               In the following, R and R0 are arbitrary sets, which
run on the input database. This property is achieved           refer to the image of the functions defined in the theo-
by ensuring (e.g., through the addition of noise to the        rems.
output or by randomizing the computation) that the
distribution of outputs of the computation or the query        Theorem 3.1 (Closure under post-processing [22]).
is “almost the same” whether or not the profile of an          Let M : D → R be a -differentially private randomized
individual was in the input database.                          mechanism and f : R → R0 be an arbitrary function,
    For our framework, we use the bounded defini-              independent of D, then the composition f ◦ M : D → R0
tion of differential privacy. The following definitions are    is also -differentially private.
Growing synthetic data through differentially-private vine copulas                126

A subtle but important point of Theorem 3.1 of the                     Estimating the copula function with a known distribu-
closure under post-processing is that the function f has               tion family results in modelling the dependence struc-
to independent of D, which means that it should not                    ture of the data. Assuming a variable of interest of di-
have access to or use any information about D. If this                 mension n, the modelling by copula is generally done by
is not the case, the theorem does not apply anymore.                   first choosing a family of copulas and then verifying the
                                                                       “goodness-of-fit” of this modelling, in the same manner
Theorem 3.2 (Sequential composition [22]). Let M1 :                    that a known distribution can be used to estimate an
D → R and M2 : D → R0 be two randomized mecha-                         arbitrary distribution.
nisms that are respectively 1 and 2 -differentially pri-                  Figure 1 [27] illustrates simulations from five com-
vate, then the composition (M1 , M2 )(x) : D → R × R0                  mon bivariate copulas and the tail dependence they help
is (1 + 2 )-differentially private.                                  to model. We refer the interested reader to [33, 49] for
                                                                       further reading about the many existing families of cop-
Theorem 3.3 (Parallel composition [22]). Let M1 :                      ulas and their tail dependence modelling (each of these
D1 → R and M2 : D2 → R0 be two randomized mech-                        books defines and discusses more than 20 families of
anisms that are respectively 1 and 2 -differentially pri-            copulas).
vate, such that D1 , D2 ⊂ D and D1 ∩ D2 = ∅, then the
composition (M1 , M2 )(x) : D → R × R0 is max(1 , 2 )-
differentially private.

Theorems 3.2 and 3.3 can easily generalized to k sequen-
tial mechanisms or disjoint subsets.

3.2 Copulas and vine copulas

Copulas. Historically, copulas date back from 1959
with Abe Sklar that has introduced this notion in his
seminal paper [60] at which time they were mainly theo-
retical statistical tools. However, in the last decade they
have experienced a significant growth in many domains                  Fig. 1. Simulations from common copulas family.
including earth science for modelling atmospheric pre-
cipitation [8], in health for diagnostic tests [31], in fi-
nances for the study of financial time series [51], in social              Sklar’s theorem is a fundamental tool in copula the-
sciences for modelling the different usages of a technol-              ory that binds together the marginals and the copula,
ogy [35], in genetics for the study of phenotypes [29] and             which means that we only need the marginals of the
even more recently in privacy for estimating the risk of               data to model the dependence structure.
re-identification [56].
                                                                       Theorem 3.4 (Sklar’s Theorem [60]). Let                        F        be
     In a nutshell, a copula is a multivariate Cumula-
                                                                       a multivariate cumulative distribution function
tive Density Function (CDF) on the unit cube with the
                                                                       with marginals FXi with i                          ∈    {1, 2, . . . , n}.
marginal distributions (or simply marginals) on its axes
                                                                       Then, there exists a copula C such that
modelling the dependence between the said marginals.
                                                                       F (X1 , X2 , . . . , Xn ) = C(FX1 , FX2 , . . . , FXn ). If all
The formal definition of a copula is as follows:
                                                                       the marginals are continuous, then C is unique.
Definition 3.2 (Copula [49]). Let (X1 , X2 , . . . , Xn ) be           Conversely, if FXi are univariate distribution func-
a random vector such as the associated cumulative                      tions and C is a copula, then C(u1 , u2 , . . . , un ) =
                                                                            −1            −1                                       −1
density functions FXi are continuous. The copula of                    F (FX 1
                                                                               (u1 ), FX    2
                                                                                              (u2 ), . . . , FXn )−1 (un )) with FX  i
                                                                                                                                          being
(X1 , X2 , . . . , Xn ), denoted by C(U1 , U2 , . . . , Un ), is de-   the inverse CDF of FXi .
fined as the multivariate cumulative density function of
                                                                       Finally, a last fundamental theorem of the domain of
(U1 , U2 , . . . , Un ) in which Ui = FXi (Xi ) are the standard
                                                                       copulas is the Probability Integral Transform (PIT) as
uniform marginal distributions of (X1 , X2 , . . . , Xn ).
                                                                       well as the Inverse PIT.
Growing synthetic data through differentially-private vine copulas                 127

Theorem 3.5 (PIT [19]). Let X be a random variable                        With this definition, each edge e ∈ Ei is associated
with FX as its cumulative density function (CDF). The               to (the density of) a bivariate copula cje ,ke |De , in which
variable Y = FX (X) is uniform standard.                            je , ke are the conditioned variables over the condition-
                                                                    ing variables set De . Thus, the conditioned variables
Theorem 3.6 (Inverse PIT [19]). Let U be a uniform                  and the conditioning set form the conditional variables
standard random variable and F an arbitrary CDF, then               Uje |De and Uke |De . The existence of such conditional
Y = F −1 (U ) is a random variable with F as its CDF.               variables is guaranteed by the conditions on the trees
                                                                    Ti , i ∈ {1, 2, . . . , n − 1} [9, 20].
The PIT and its inverse are used to transform data into                   To help visualize what is a vine, Figure 2 [1] shows
uniform standard distribution for the copulas to model              an example of a vine on 5 variables. In this example, on
and back into the original domain after the generation              the tree T4 , the conditioned variables are 1 and 5 and the
of new observations.                                                conditioning set is {2, 3, 4} forming the variables U1|2,3,4
     When applied on experimental data, uniform stan-               and U5|2,3,4 .
dard marginals take the form of pseudo-observations
(i.e., normalized ranked data) that are computed via
the PIT. The pseudo-observations are then used to find
the best fitted copula model. Figure 1 illustrates such an
example in which data points (i.e., pseudo-observations)
are shown to best fit various copula models. The vine
copulas described hereafter offer a more refined manner
to describe the dependency structure.
     Vine copulas. Vine copulas were introduced by
Bedford and Cooke in the early 2000s to refine the mod-
elling capabilities of copulas [9, 10]. The principle be-
hind them is simple: they decompose the multivariate
functions of the copula into multiple “couples” of cop-             Fig. 2. Example of a vine on 5 variables.
ulas. A vine on n variables is a set of nested connected
trees (T1 , T2 , . . . , Tn ) such that edges of a tree Ti corre-
                                                                        From a vine V , it is possible to describe a multivari-
sponds to vertexes in the following tree Ti+1 . Regular
                                                                    ate density function as the product of bivariate copulas.
vines refer to a subset of vines constructed in a more
                                                                    The following theorem implies that the joint density of a
constrained way. As we only use regular vines in this
                                                                    multivariate random variables can always be estimated
paper, the term “vines” will refer to regular vines here-
                                                                    via vine copulas.
after.
                                                                    Theorem 3.7 (Vine density decomposition [9]). Let
Definition 3.3 (Regular vine [9]). A regular vine on n
                                                                    V be a vine on n variables, then there is a unique
variables contains n − 1 connected nested trees denoted
                                                                    density function f such as
T1 , T2 , . . . , Tn−1 . Let Ni and Ei be the set of nodes and
                                                                                                     n
edges of the tree Ti . These trees satisfy the following                                             Y
                                                                        f (X1 , X2 , . . . ,Xn ) =         fi (Xi )
properties:
                                                                                                     i=1
1. T1 is the tree that contains n nodes corresponding                                     n−1
     to the n variables.
                                                                                          Y      Y                                         
                                                                                                           cje ,ke |De Uje |De , Uke |De
2. For i ∈ {2, 3, . . . , n − 1}, the nodes set of Ti is such                             m=1 e∈Em
     as Ni = Ei−1 .
3. For i ∈ {2, 3, . . . , n−1}, {a, b} ∈ Ei with a = (a1 , a2 )     in which Uje |De        = Fje |De (Xje |De ) and Uke |De                   =
     and b = (b1 , b2 ), it must hold that |a ∩ b| = 1.             Fke |De (Xke |De ).

Condition 3. is called the proximity condition and can              Formal and thorough definitions of vine copulas are
be translated to two nodes in tree Ti are connected by              available in [9, 10, 20].
an edge if and only if these nodes, as edges, share a                   The construction of a vine is non-trivial as there are
                                                                    approximately n!      (n−2
                                                                                            2 ) different vines on n variables.
common node in tree Ti−1 .                                                           2 ·2
                                                                    The current state-of-the-art technique of construction
Growing synthetic data through differentially-private vine copulas   128

uses a top-down greedy heuristic known as the Diss-           of the Lasso [69] method and the use of Reinforcement
mann’s algorithm [20].                                        Learning (RL) with Long Short Term Memory net-
     Dissmann’s algorithm for vine copula selec-              works (LSTM) for near-optimal vine selection. Another
tion. The technique known as the Dissmann’s algorithm         approach found in the literature uses auto-encoders
[20] is based on the hypothesis that the top of the tree      on the data (here it could be used on the private
is more important in modelling the data than the lower        pseudo-observations) to mitigate the curse of dimension-
branches as they contain raw information about the cor-       ality [66]. A recent paper proposed a differentially pri-
relation between attributes. Dissmann’s algorithm can         vate graphical model for low-dimensional marginals [42]
be summarized as:                                             that could also be beneficial to our work. This work
1. Choose the optimal tree that maximizes the sum of          defines an ad hoc framework for accurately and pri-
     correlation coefficients between variables.              vately estimate marginals and improve existing genera-
2. Choose the optimal copula family for each pair of          tive models such as PrivBayes. We leave the investiga-
     attributes (imposed by the tree) that minimizes the      tion of these approaches as future works.
     information loss.
3. Construct all the following trees (from the previous
     optimal tree) that respect the regular vine defini-
     tion 3.3 and return to Step 1. or stop if no tree can
                                                              4 COPULA-SHIRLEY
     be constructed.
                                                              COPULA-SHIRLEY is a simple and flexible framework
                                                              for private data synthesis through vine copulas. In this
Step 1 is done by computing all the possible linear trees
                                                              section, we will first provide an overview of the algo-
(i.e., paths) and assigning to each edge the correlation
                                                              rithm, before describing in detail each of its subcompo-
coefficients between the two variables represented by the
                                                              nents. Finally, we present our privacy analysis as well
vertices. The path with the maximal sum of the corre-
                                                              as the membership inference attack that we have used
lation coefficients is chosen. For Step 2, each pair of
                                                              to evaluate the privacy level provided by our method.
attributes linked in the path are fitted on bivariate cop-
ula functions and the best fitted one is chosen using a
model selection criterion (such as Akaike Information
                                                              4.1 Overview
Criterion [3] or Bayesian Information Criterion [57]).
     Note that due to the sequential aspect of the algo-      Algorithm 1 outlines the framework of our method. The
rithm, it can be stopped at any level [12]. A truncated       COPULA-SHIRLEY algorithm takes as input a dataset
vine copula will have independent bivariate copulas to        D as well as a parameter , representing the privacy
model the conditional variables for all the trees below       budget. The last two input parameters of COPULA-
the threshold. We will refer to the truncation level as       SHIRLEY, nGen and EncodingM ethod correspond re-
Ψ.                                                            spectively to the number of synthetic data points to
     The complexity of this heuristic lies on the fitting     generate and which method of encoding for the categor-
of n bivariate copulas on the first tree T1 , n − 1 copulas   ical attributes to use.
on the second tree T2 and so on until the last bivariate           Note that copulas are originally mathematical ob-
copula on the last tree Tn−1 , in which n is the number       jects used for modelling continuous numerical variables.
of attributes. If e(m) is the complexity of estimating a      This means that the application of copulas to categori-
bivariate copula given m records, then the complexity of      cal attributes requires an adequate preprocessing of the
the Dissmann’s algorithm is given by O(e(m) × n × Ψ).         data, which is done by the P reprocessing function. Us-
     The critical drawback of a sequential heuristic such     ing the preprocessed data, the algorithm can tackle the
as Dissmann’s algorithm is that reaching the optimal          task of building a vine copula for modelling the data in a
solution is not guaranteed, in particular for high dimen-     differentially-private manner. Afterwards, this vine cop-
sions. Moreover, the greedy algorithm needs to compute        ula can be sampled for producing new synthetic data.
the spanning tree and to choose a copula family for ev-
ery pair of nodes at each level of the vine leading to
important computational time with the increase in di-         4.2 Preprocessing
mensionality.
     Recent advances regarding vine copulas construc-         As directly cited from [36]: “Although the data should
tion such as [45] and [65] propose respectively the use       be continuous to guarantee the continuity of margins,
Growing synthetic data through differentially-private vine copulas   129

 Algorithm 1: COPULA-SHIRLEY                                 erence attribute given the (encoded) attribute’s value.
   Input: Dataset: D, global privacy budget: ,              The GLMM can be seen as an extension of logistic re-
           number of records to generate: nGen,              gression in which the encoding of an attribute is com-
           encoding method : EncodingM ethod                 puted by the expected value of an event (i.e., encoded
   Output: Differentially-private synthetic                  attribute) given the values of the predictor.
             records: Dsyn                                        The preprocessing method shown in Algorithm 2
 1 (pseudoObs, dpCDF s) ←−                                   converts categorical values into dummy variables (if nec-
    Preprocess(D, , EncodingM ethod)                        essary), computes the differentially-private CDFs from
    (Section 4.2)                                            the noisy histograms and outputs the noisy pseudo-
 2 vineM odel ←− SelectVineCopula(pseudoObs)                 observations needed for the vine copula model construc-
    (Section 4.3)                                            tion. In the following, we describe in more details these
 3 Dsyn ←−                                                   different processes.
    GenerateData(vineM odel, nGen, dpCDF s)                       Split training. Since our framework needs two
    (Section 4.4)                                            sequential training processes: learning differentially-
 4 return Dsyn                                               private cumulative functions and learning the vine-
                                                             copula model, we rely on the parallel composition (The-
                                                             orem 3.3) of differential privacy by splitting the dataset
discrete data in a large domain can still be considered      D into two subsets instead of using the sequential com-
as approximately continuous as their cumulative density      position (Theorem 3.2) and splitting the global bud-
functions do not have jumps, which ensures the conti-        get . In this manner, we efficiently use the modelling
nuity of margins.” This implies that if treated as con-      strengths of copulas functions by sacrificing data points
tinuous, discrete data can be modelled with copulas.         to reduce the noise added via the differentially-private
Thus, copulas can be used to model a wide range of           mechanism (such as the Laplacian mechanism) while
data without the need of much preprocessing. However,        preserving much of the data’s utility. Algorithm 2, line 3
an important issue arises when copulas are applied on        shows the process of splitting the dataset into two sub-
categorical data as such data do not have an ordinal         sets.
scale and copulas mostly use rank correlation for mod-
elling. In this situation, one common trick is to view         Algorithm 2: Preprocessing
categorical data simply as discrete data in which the            Input: Dataset: D, global privacy budget: ,
order is chosen arbitrarily (e.g., by alphabetical order),               encoding method : EncodingM ethod
which is known as ordinal encoding. This trick has been          Output: Differentially-private
proven to be a downfall for vine copula modelling for                      pseudo-observations: pseudoObs,
some metrics, as it will be discussed in Section 5.                        differentially-private CDFs: dpCDF s
     Another technique to deal with categorical at-            1 D ← EncodingMethod(D)
tributes is the use of dummy variables, in which cat-          2 (dpT rainSet, modelT rainSet) ←
egorical values are transformed into binary indicator             SplitDataset(D)
variables [64] (this technique is also known as one-hot        3 dpHistograms ←
encoding). The use of dummy variables helps preserve              ComputeDPHistograms(dpT rainSet, )
the rank correlation between the attributes used by the        4 foreach hist in dpHistograms do
copulas (pseudo-observations).                                 5    cdf ← CumulSum(hist[binCounts])
                                                                              Sum(hist[binCounts])
     Two other supervised encoding techniques known
                                                               6      dpCDF s[col] ← cdf
as the Weight of Evidence (WOE) encoding [72] and
the Generalized Linear Mixed Model (GLMM) encod-               7   foreach col in modelT rainSet do
ing [13, 71] have been evaluated. Both need a reference        8      cdf ← dpCDF s[col]
attribute (called a predictor) and encode the categori-        9      pseudoObs[col] ← cdf (modelT rainSet[col])
cal attributes in a way that maximizes the correlation        10   return (pseudoObs, dpCDF S)
between the pair of encoded and reference attributes.
The WOE encoder can only be used with a binary ref-
erence attribute and is computed using the natural log           Computation of DP histograms. The estima-
of the number of 1’s over the number of 0’s of the ref-      tion of differentially-private histograms is the key pro-
Growing synthetic data through differentially-private vine copulas   130

cess of COPULA-SHIRLEY. The current implementation            use of vine copulas in such context is more complex, as
computes naively a differentially-private histogram by        for each pair of copulas, a noisy estimation of its pa-
adding Laplacian noise of mean 0 and scale ∆    with ∆ =     rameters is required, which results in an overall huge
2 to each bin count of every histogram (in the bounded        injection of noise. Such a framework would also need a
setting of DP, the global sensitivity ∆ of histogram com-     differentially-private implementation of the Dissmann’s
putation is 2). A more sophisticated method, such as          algorithm, which is a complex task.
the ones defined in [2, 73, 76], could possibly be used to         In contrast, COPULA-SHIRLEY reduces consider-
improve on the utility of the synthetic data.                 ably the noise added and the complexity of the im-
     Computations of CDFs. Cumulative and proba-              plementation by computing the dependencies on the
bility density functions are intrinsically linked together    noisy marginal densities rather than on the original
and it is almost trivial to go from one to another. With      data, thus enabling the use of vine copulas in a pri-
continuous functions, CDFs are estimated via integra-         vate manner. COPULA-SHIRLEY is generic in the sense
tion of the PDF curves. In a discrete environment like        that it can be used with any tree or copula selection
ours, as we use histograms to estimate PDFs, a sim-           criterion, any algorithm for vine construction and any
ple normalized cumulative sum over the bin counts pro-        method for differentially-private histograms or PDFs.
vides a good enough estimation of the CDFs, which is          The complexity of COPULA-SHIRLEY is mainly im-
shown on line 6 of Algorithm 2. This is similar to the        pacted by the algorithm for selecting the vine copula as
approach proposed by the Harvard University Privacy           the complexity of ComputeDPHistograms, ComputeCDFs
Tools Project in the lecture notes about DP-CDFs [44],        and ComputePseudoObs is O(nd), in which n is the num-
in which they add noise to the cumulative sum counts          ber of profiles and d is the number of attributes (i.e., di-
and then normalize. Our method always produces a              mensions) describing each profile. This latest cost is neg-
strictly monotonically increasing function, whereas their     ligible compared to the cost of Dissmann’s algorithm.
approach tends to produce non-monotonic jagged CDFs.
A strictly monotonic increasing function is desirable
                                                                Algorithm 3: SelectVineCopula
as it means that the theoretical CDF always has an
inverse. Non-monotonic jagged CDFs can produce er-                Input: Pseudo-observations: pseudoObs
ratic behaviours when transforming data to pseudo-                Output: Vine Copula Model: vineM odel
                                                                1 vineM odel ←
observations and especially when mapping back to the
natural scale with the inverse CDFs.                               DissmannVineSelection(pseudoObs)
     Computation of pseudo-observations. As cop-                   (Section 3.2)
                                                                2 return (vineM odel)
ulas can only model uniform standard marginals (i.e.,
pseudo-observations, we need to transform our data.
Easily enough, the PIT states that for a random vari-
able X and its CDF FX , we have that FX (X) is uniform            Algorithm 3 outlines the general process for the
standard. Algorithm 2 does this at lines 8-10 by first ex-    vine selection. As stated earlier, our method only needs
tracting the corresponding CDF (line 9) before applying       the differentially-private uniform standard pseudo-
the said CDF onto the data (disjoint from the set used        observations in order to select a vine. Dissmann’s al-
for computing the CDFs) (line 10).                            gorithm is then used with the noisy observations, and
                                                              only these observations, to fit the best vine copula. The
                                                              current implementation of COPULA-SHIRLEY uses the
4.3 Construction of the vine copula                           implementation of Dissmann’s algorithm from the R li-
                                                              brary rvinecopulib [47]. It offers a complete and highly
The two existing previous works at the intersection of        configurable method for vine selection as well as some
differential privacy and copulas are restricted to Gaus-      optimization techniques for reducing the computational
sian copulas [6, 36] because it makes it easier to build      time for building the vine.
them in a differentially private manner. Indeed, it is pos-
sible to represent a Gaussian copula by the combination
of a correlation matrix and the marginal densities. In        4.4 Generation of synthetic private data
such case, it is enough to introduce appropriate noise
in these two mathematical structures to obtain a cop-         The last step of our framework is the generation
ula model that is differentially-private by design. The       of synthetic data. To realize this, uniform standard
Growing synthetic data through differentially-private vine copulas   131

observations must be sampled from the vine copula            pseudoObs, we simply apply the differentially-private
model selected via the previous method. As we use the        CDFs to the held-to model training set. The resulting
rvinecopulib implementation of the vine selection al-        pseudoObs data is -differentially-private due to the ap-
gorithm, we naturally use their implementation of the        plication of a -differentially-private mechanism and by
observation sampling from vines. Line 1 of Algorithm 4       the Parallel Composition theorem.
refers to the implementation of this sampling. We re-             DissmannVineSelection only uses the -differentially-
fer the interested reader to [20] for more details about     private set of data pseudoObs for its selection, resulting
how to sample from nested conditional probabilities as       in a -differentially-private vine structure by the Closure
defined by the vine copula structure.                        Under Post-Processing property. Both, SampleFromVine
     To map the sampled observations back to their           and InverseFunction only use private data and therefore
original values, we use the Inverse Probability Integral     do not violate the Closure Under Post-Processing prop-
Transform (Inverse PIT), which only requires the in-         erty. Finally, the whole process is closed and never vi-
verse CDFs of the attributes. This process is shown on       olates the Closure Under Post-Processing property of
lines 2 to 5 of Algorithm 4. This last step concludes the    differential privacy, as the algorithm only operates in-
framework of differentially-private data generation with     dependently for each attribute and therefore makes use
COPULA-SHIRLEY.                                              of the Parallel Composition property; thus Algorithm 1
                                                             is -differentially-private.

 Algorithm 4: GenerateData
   Input: Vine Copula Model: vineM odel,
                                                             4.6 Privacy evaluation through
          Number of records to generate: nGen,
          Differentially-private CDFs: dpCDF s                   membership inference
   Output: Synthetic records: Dsyn
                                                             As stated earlier, we believe that one should not rely
 1 synthObs ←
                                                             solely on the respect of a formal model such as differ-
    SampleFromVine(vineM odel, nGen)
                                                             ential privacy but also that synthetic data should be
 2 foreach col in synthObs do
                                                             assessed with respect to a privacy test based on infer-
 3    cdf ← dpCDF s[col]
                                                             ence attacks to quantify the privacy protection provided
 4    invcdf ← InverseFunction(cdf )
                                                             by the synthesis method.
 5    Dsyn [col] ← invcdf (synthObs[col])
                                                                  In this work, we opted for the Monte Carlo mem-
 6   return Dsyn                                             bership inference attack (MCMIA) introduced by Hil-
                                                             precht, Härterich and Bernau [30] to assess the privacy
                                                             of our method. Simply put, MCMIA quantifies the risk
                                                             of pinpointing training records from other records drawn
                                                             from the same distribution given synthetic profiles. One
4.5 Differential privacy analysis                            of the benefits of MCMIA is that it is a non-parametric
                                                             and model-agnostic attack. In addition, MCMIA pro-
It should be clear by now that COPULA-SHIRLEY relies         vides high accuracy in situations of model overfitting
solely on differentially-private histograms, which makes     in generative models and outperforms previous attacks
the whole framework differentially-private by design.        based on shadow models.
                                                                  The framework of the attack is as follows. Let ST
Theorem 4.1 (DP character of COPULA-SHIRLEY).                be a subset of m records from the training set of the
Algorithm 1 is -differentially-private.                     model and SC be a subset of m control records. Control
                                                             records are defined as records from the same distribu-
Proof. The method ComputeDPHistograms provides -            tion as the training ones but never used in the training
differentially-private histograms by the Laplacian mech-     process. Let x be a record from the global data domain
anism theorem [22]. The computation of dpCDF s only          D; Ur (x) = {x0 ∈ D | d(x, x0 ) ≤ r} is defined as the
uses the previous differentially-private histograms to       neighbourhood of x with respect to the distance d (i.e.
compute the CDFs, in a parallel fashion; thus these          the set of records x0i close to x). A synthetic record g
density functions are -differentially-private as per the    from a generative model G is more likely to be similar
Closure Under Post-Processing and Parallel Composi-          to a record x as the probability P [g ∈ Ur (x)] increases.
tion properties of differential privacy. To obtain the       The probability P [g ∈ Ur (x)] is estimated via the Monte
Growing synthetic data through differentially-private vine copulas   132

                                      Pn
Carlo integration: P [g ∈ Ur (x)] ≈ n1 i=1 1gi ∈Ur (x) , in      ated by COPULA-SHIRLEY. In addition, we compare it
which gi are synthetic samples from G.                           with two -differentially private data synthesis meth-
    To further refine the attack, the authors propose            ods presented in Section 2, namely PrivBayes [77]
an alternative estimation of P [g ∈ Ur (x)] based on the         and DP-Copula [36]. In McKay’s and Snoke’s arti-
distance between x and its neighbours :                          cle about the NIST challenge on differentially-private
                            n                                    data synthesis [41], the authors ranked PrivBayes in
                       1X
    P [g ∈ U (x)] ≈      1gi ∈Ur (x) log (d(x, gi ) + η)        the top 5, supporting its usage in our comparative
                       n
                         i=1                                     work. As COPULA-SHIRLEY can be seen as a refine-
in which η is an arbitrary small value set to avoid log 0        ment of the DP-Copula model with the introduction
(we used η = 10−12 ). The function                               of vines, we wanted to compare our method to a
                        n                                        differentially-private Gaussian copula approach. Our ex-
                   1X
      fˆM C (x) :=    1gi ∈Ur (x) log (d(x, gi ) + η)            perimental setup is available as a Python script at:
                   n
                       i=1
                                                                 https://github.com/alxxrg/copula-shirley.
will be the record-scoring function. From this definition,
if the record x obtains a high score, the synthetic sam-
ples gi are very likely to be close to x and implies an          5.1 Experimental setting
overfitted model over x.
     To compute the privacy score of a synthetic dataset,        Datasets. Three datasets of various dimensions have
given SG a set of n synthetic samples from a generative          been used in our experiments. The first one is the UCI
model G, we first compute fˆM C (x) over SG for all x ∈          Adult dataset [21], which contains 32 561 profiles. Each
ST ∪ SC . Afterwards, we take the m records from the             profile is described by 14 attributes such as gender, age,
union set ST ∪ SC with the highest fˆM C scores to form          marital status and native country. The attributes are
the set I. By computing the ratio of records that are            mostly categorical (8), the rest is discrete (6).
both in ST and I, we obtain the Privacy Score of the                  The second dataset used is COMPAS [5], which con-
model over the set ST :                                          sists of records from criminal offenders in Florida during
                                    card(ST ∩ I)                 2013 and 2014. It is the smallest set of the three and con-
          P rivacyScore(ST ) :=
                                      card(ST )                  tains 10 568 profiles, each described with 13 attributes
P rivacyScore(ST ) can be interpreted as the ratio of            quite similar to Adult that are either discrete or cate-
training records successfully distinguished from control         gorical with the same proportion.
records. From this definition, a score of 1 means that the            The third dataset used is Texas Hospital [68] from
full training set has been successfully recovered, thus          which we uniformly sampled 150 000 records from a set
implying a privacy breach while a score of 0.5 means             of 636 140 records and selected 17 attributes, 11 of which
that the records from the training and control sets are          are categorical. We sample down this dataset to reduce
indistinguishable among the synthetic profiles.                  the burden of the computational task, mainly for the
     In our experiments, we found out that using a               PrivBayes algorithm.
coarser distance like the Hamming distance provide                    Parameters for data synthesis. To evaluate the data
higher membership inference scores than the Euclidean            synthesis, we have run a k-fold cross-validation tech-
distance. To set the value of r, the size of the neigh-          nique [25] with k = 5. For each fold, all generative mod-
bourhood, we used the median heuristic as defined by             els are learned on the training set of this fold and then
the authors as it was the one providing the highest ac-          synthesized the same number of records. All evaluation
curacy:                                                          metrics are measured by using the fold’s left-out test set
                                                               and the newly generated synthetic data. For the privacy
          r = median1≤i≤2m min d(xi , gj )                       budget, we have tried various values for this parame-
                                 1≤j≤n
                                                                 ter in the range  ∈ [0.0, 8.0, 1.0, 0.1, 0.01] (here  = 0
in which xi ∈ ST ∪ SC and gj ∈ SG .
                                                                 means that the DP is deactivated), similar to the pa-
                                                                 rameters used in the comparative study of the NIST
                                                                 competition for evaluating differentially-private synthe-
5 Experiments                                                    sis method [41].
                                                                      For the categorical encoders, we have used the
In this section, we investigate experimentally the               Python library category_encoders [40]. For the super-
privacy-utility trade-off of the synthetic data gener-           vised encoders (WOE & GLMM), to avoid leaking in-
Growing synthetic data through differentially-private vine copulas   133

formation about the training set, we train the encoders       5.2 Utility testing
on a disjoint set that is the fold’s left-out test set. By
default, we used the WOE encoder as it was shown to           Statistical tests. Statistical tests can be used to quantify
provide slightly better results (see Figure 4). For the       the similarity between the original dataset and the syn-
membership inference attack implementation discussed          thetic one. We use the Kolmogorov-Smirnov (KS) dis-
in Section 4.6, the control set used is the fold’s left-out   tance to estimate the fidelity of the distributions of the
set.                                                          synthetic data compared to the original data. The KS
     Implementation details. As previously mentioned,         distance is computed per attributes, the reported scores
COPULA-SHIRLEY uses the implementation of Diss-               in the results section representing the average score over
mann’s algorithm from the R library rvinecopulib [47].        all attributes. We also evaluate the variation in the cor-
In our tests, we used all the default parameters of the       relation matrices between the raw and synthetic data
vine selection method, which can be found in the refer-       using the Spearman’s ρ rank correlation. The score rep-
ence section [47], except for two parameters: the trunca-     resent the mean absolute difference of the correlation
tion level Ψ and the measure of correlation for the opti-     coefficients between the two matrices. In the following
mal tree selection. We opted to stop at the second level      section, we refer to this score as the correlation delta
of the tree (Ψ = 2). The truncation level has been thor-      metric. See Appendix A for more details on these tests.
oughly studied and we show that a deeper vine model                Classification tasks. Classification tasks are comple-
does not drastically improve the model as shown in Ap-        mentary to statistical tests in the sense that they sim-
pendix C. We also opted for the Spearman’s ρ rank cor-        ulate a specific use case, which is the prediction of a
relation as it is the preferred statistic when ties occur     specific attribute, thus helping to evaluate if the cor-
in the data [53].                                             relations between attributes are preserved with respect
     As stated earlier, we have split the training set into   to the classification task considered. In this setting, the
two disjoint subsets with a ratio of 50/50, respectively      utility is measured by training two classifiers. The first
for the differentially-private histograms and the pseudo-     classifier is learned on a training set (i.e., the same train-
observations. The impact of different ratios is illustrated   ing set used by the generative models), drawn from the
in Figure 3. Finally, for a more refined representation,      original data while the second one is trained on the syn-
we choose to use as many bins for our histograms as           thetic data produced by a generative model. Afterwards,
there are unique values in the input data. By default,        the classifiers are tested on the same test set, which is
we use the Laplacian mechanism for computing the DP-          drawn from the original data but disjoint from the train-
histograms (see, however, Appendix B for the impact on        ing set. Ideally, a classifier trained on the synthetic data
the utility of other DP mechanisms).                          would have a classification performance similar to the
     PrivBayes. We use the implementation of PrivBayes        one trained on the original data. The comparison be-
referenced by [41] called DataSynthesizer [52]. Apart         tween classifiers is done through the use of the Matthews
from the privacy budget , the implementation of              Correlation Coefficient (MCC) [39], which is a measure
PrivBayes we have run has only one parameter, which           of the quality of a classification (see Appendix A). Note
is the maximal number of parents nodes in the Bayesian        that the MCC of a random classifier is close to ≈ 0.0
network. We use the default parameter which is 3.             while the score of perfect classifier would be 1.
     DP-Copula. While we did not find an official imple-           In our experiments, we evaluated the synthetic data
mentation of Li, Xiong and Jiang [36], we discovered          on two classification tasks: on a binary classification
and used an open-source implementation available on           problem and a multi-class classification problem. We
GitHub [54]. In addition to the privacy budget, the only      opted for gradient boosting [23] classifier as this algo-
parameter of DP-Copula is used to tune how this bud-          rithm is known for its robustness to overfitting and its
get is divided between the computation of the marginal        performance that is often close to the state-of-the-art
densities and the correlation matrix in a differentially-     methods, such as deep neural networks, on many classi-
private manner. We choose to set the value of this pa-        fication tasks. We use the XGBoost [16] implementation
rameter so that half of the privacy budget is dedicated       of the gradient boosting algorithm.
to the computation of the marginal densities and half              To further deepen our analysis, we also evaluated
to the computation of the correlation matrix.                 the synthetic data over a simple linear regression task.
                                                              To evaluate its success, we computed the root mean
Growing synthetic data through differentially-private vine copulas                                               134

square error (RMSE):                                                                    0.6                                                                           0.5

                                                                                        0.5
                                                                                                                                                                      0.4

                                                                     Binary Class MCC
                            r Pn

                                                                                                                                                                            Multi Class MCC
                                 i=0 (y˜i   − yi )2                                     0.4
           RMSE(Y, Ỹ ) =                                                                                                                                             0.3
                                      n                                                 0.3
                                                                                                                                                                      0.2
                                                                                        0.2
in which Y are the true values and Ỹ are the linear                                                                                                                  0.1
                                                                                        0.1
model’s outputs.
                                                                                        0.0                                                                           0.0
    For the Adult dataset, the binary classification                                          COMPAS   Texas   Adult                  COMPAS    Texas         Adult
                                                                                                                                                                      0.14
problem is over the attribute salary, the multi-class
                                                                                        1.0                                                                           0.12
classification problem over the attribute relationship
                                                                                                                                                                      0.10

                                                                                                                                                                                 Mean Corr Delta
                                                                                        0.8

                                                                     LinReg RMSE
and the linear regression over age. On COMPAS,                                                                                                                        0.08
                                                                                        0.6
the binary classification task is over the attribute                                                                                                                  0.06
                                                                                        0.4
is_violent_recid, the multi-class problem on race and                                                                                                                 0.04
                                                                                        0.2
the linear regression on the attribute decile_score.                                                                                                                  0.02

                                                                                        0.0                                                                           0.00
For the classification and regression tasks on Texas,                                         COMPAS   Texas   Adult                  COMPAS    Texas         Adult

the attributes ETHNICITY, TYPE_OF_ADMISSION and                                  0.10
                                                                                                                                                                      1.0
TOTAL_CHARGES are used respectively for the binary,                              0.08
                                                                                                                                                                      0.8
multi-class and regression problems.

                                                                                                                                                                            Execution time
                                                                   KS Dist
                                                                                 0.06
                                                                                                                                                                      0.6

                                                                                 0.04
                                                                                                                                                                      0.4

5.3 Results                                                                      0.02                                                                                 0.2

                                                                                 0.00                                                                                 0.0
                                                                                              COMPAS   Texas   Adult                  COMPAS    Texas         Adult
Splitting ratios for the vine copula model. Recall that                                                                Splitting ratios
in our preprocessing step (Algorithm 2, the training                                                     0.1     0.3            0.5       0.7           0.9

dataset is split in two sets, one for learning the DP-CDFs
                                                                  Fig. 3. Impact of the splitting ratios for the vine copula model
and the other for computing pseudo-observations. We
                                                                  with  = 1.0. The linear regression RMSE and the execution
first evaluate the splitting ratios between the two sets.         time are normalized. For the MCC, higher is better, while the
As shown in Figure 3, most metrics are stable across              other metrics, lower is better. Measures are averaged over a 5-
the different ratios. One exception is the KS distance,           fold cross-validation run.
which exhibits an increase when the ratio given for the
pseudo-observations is higher, giving the importance of
                                                                  that PrivBayes provides better score than COPULA-
the DP-CDFs. As there is no clear consensus in the data,
                                                                  SHIRLEY for most of the other classification tasks. In
we opted for a ratio of 50/50 for the other experiments.
                                                                  addition, DP-Copula and DP-Histograms failed com-
                                                                  pletely at the two classification tasks. Our approach
     Encoding of the categorical attributes. The proper
                                                                  and PrivBayes performed well on the regression task
encoding of the categorical attributes is crucial for our
                                                                  compared to DP-Copula. PrivBayes generated the data
approach. As shown in Figure 4, both classification tasks
                                                                  that with the smallest correlation coefficients difference
display lower performances with the ordinal encoding.
                                                                  from the original datasets except for the Texas dataset
In addition, the one-hot encoding could not be tested
                                                                  in which COPULA-SHIRLEY provides the best results
on Texas as the number of attributes augment from 17
                                                                  overall. In addition, COPULA-SHIRLEY is always the
to 5375, due to the increase in the computational bur-
                                                                  preferred one for generating faithful distributions. The
den as our approach scales linearly with the number
                                                                  privacy evaluation demonstrates that a smaller privacy
of attributes. The one-hot encoding also perform badly
                                                                  budget does not necessarily mean a lower risk of mem-
for the KS distance and the regression task in addition
                                                                  bership inference. Furthermore, all models do not pro-
of considerably increasing the execution time. We opted
                                                                  vide a consistent protection over all the datasets. While
for the WOE encoder as it performed best for both clas-
                                                                  PrivBayes offers the best protection on the smallest
sification tasks compared to the GLMM encoder.
                                                                  dataset (i.e., COMPAS), COPULA-SHIRLEY is the best
     Comparative study of the models. Figure 5 illus-
                                                                  on the biggest dataset (i.e., Texas).
trates the scores for each privacy budget over the three
                                                                      Figure 6 exhibits the global scores for each gener-
datasets and four models separately while Figure 6 dis-
                                                                  ative model’s synthetic data. PrivBayes produced the
plays the score over all the values of the privacy bud-
                                                                  best results overall for the classification tasks and some
get combined. One of the trends that we observed is
You can also read