Differential Privacy at Risk: Bridging Randomness and Privacy Budget

Page created by Victor Collins
 
CONTINUE READING
Proceedings on Privacy Enhancing Technologies ; 2021 (1):64–84

Ashish Dandekar*, Debabrota Basu*, and Stéphane Bressan

Differential Privacy at Risk: Bridging
Randomness and Privacy Budget
Abstract: The calibration of noise for a privacy-
preserving mechanism depends on the sensitivity of the
                                                                1 Introduction
query and the prescribed privacy level. A data steward
                                                                Dwork et al. [12] quantify the privacy level ε in ε-
must make the non-trivial choice of a privacy level that
                                                                differential privacy (or ε-DP) as an upper bound on the
balances the requirements of users and the monetary
                                                                worst-case privacy loss incurred by a privacy-preserving
constraints of the business entity.
                                                                mechanism. Generally, a privacy-preserving mechanism
Firstly, we analyse roles of the sources of randomness,
                                                                perturbs the results by adding the calibrated amount of
namely the explicit randomness induced by the noise
                                                                random noise to them. The calibration of noise depends
distribution and the implicit randomness induced by
                                                                on the sensitivity of the query and the specified pri-
the data-generation distribution, that are involved in
                                                                vacy level. In a real-world setting, a data steward must
the design of a privacy-preserving mechanism. The finer
                                                                specify a privacy level that balances the requirements
analysis enables us to provide stronger privacy guaran-
                                                                of the users and monetary constraints of the business
tees with quantifiable risks. Thus, we propose privacy
                                                                entity. For example, Garfinkel et al. [14] report on is-
at risk that is a probabilistic calibration of privacy-
                                                                sues encountered when deploying differential privacy as
preserving mechanisms. We provide a composition the-
                                                                the privacy definition by the US census bureau. They
orem that leverages privacy at risk. We instantiate the
                                                                highlight the lack of analytical methods to choose the
probabilistic calibration for the Laplace mechanism by
                                                                privacy level. They also report empirical studies that
providing analytical results.
                                                                show the loss in utility due to the application of privacy-
Secondly, we propose a cost model that bridges the gap
                                                                preserving mechanisms.
between the privacy level and the compensation budget
                                                                     We address the dilemma of a data steward in two
estimated by a GDPR compliant business entity. The
                                                                ways. Firstly, we propose a probabilistic quantification
convexity of the proposed cost model leads to a unique
                                                                of privacy levels. Probabilistic quantification of privacy
fine-tuning of privacy level that minimises the compen-
                                                                levels provides a data steward with a way to take quan-
sation budget. We show its effectiveness by illustrat-
                                                                tified risks under the desired utility of the data. We refer
ing a realistic scenario that avoids overestimation of the
                                                                to the probabilistic quantification as privacy at risk. We
compensation budget by using privacy at risk for the
                                                                also derive a composition theorem that leverages privacy
Laplace mechanism. We quantitatively show that com-
                                                                at risk. Secondly, we propose a cost model that links the
position using the cost optimal privacy at risk provides
                                                                privacy level to a monetary budget. This cost model
stronger privacy guarantee than the classical advanced
                                                                helps the data steward to choose the privacy level con-
composition. Although the illustration is specific to the
                                                                strained on the estimated budget and vice versa. Con-
chosen cost model, it naturally extends to any convex
                                                                vexity of the proposed cost model ensures the existence
cost model. We also provide realistic illustrations of how
                                                                of a unique privacy at risk that would minimise the bud-
a data steward uses privacy at risk to balance the trade-
                                                                get. We show that the composition with an optimal pri-
off between utility and privacy.
                                                                vacy at risk provides stronger privacy guarantees than
Keywords: Differential privacy, cost model, Laplace             the traditional advanced composition [12]. In the end,
mechanism                                                       we illustrate a realistic scenario that exemplifies how the
DOI 10.2478/popets-2021-0005
Received 2020-05-31; revised 2020-09-15; accepted 2020-09-16.

                                                                *Corresponding Author: Debabrota Basu: Dept. of
                                                                Computer Sci. and Engg., Chalmers University of Technology,
                                                                Göteborg, Sweden, E-mail: basud@chalmers.se
*Corresponding Author: Ashish Dandekar: DI ENS,                 Stéphane Bressan: National University of Singapore, Singa-
ENS, CNRS, PSL University & Inria, Paris, France, E-mail:       pore, E-mail: steph@nus.edu.sg
adandekar@ens.fr
Differential Privacy at Risk    65

data steward can avoid overestimation of the budget by               the optimal privacy at risk, which is estimated using the
using the proposed cost model by using privacy at risk.              cost model, with traditional composition mechanisms –
     The probabilistic quantification of privacy levels de-          basic and advanced mechanisms [12]. We observe that
pends on two sources of randomness: the explicit ran-                it gives stronger privacy guarantees than the ones ob-
domness induced by the noise distribution and the im-                tained by the advanced composition without sacrificing
plicit randomness induced by the data-generation distri-             on the utility of the mechanism.
bution. Often, these two sources are coupled with each                    In conclusion, benefits of the probabilistic quantifi-
other. We require analytical forms of both sources of                cation i.e., of the privacy at risk are twofold. It not
randomness as well as an analytical representation of                only quantifies the privacy level for a given privacy-
the query to derive a privacy guarantee. Computing the               preserving mechanism but also facilitates decision-
probabilistic quantification of different sources of ran-            making in problems that focus on the privacy-utility
domness is generally a challenging task. Although we                 trade-off and the compensation budget minimisation.
find multiple probabilistic privacy definitions in the lit-
erature [16, 27] 1 , we miss an analytical quantification
bridging the randomness and privacy level of a privacy-
preserving mechanism. We propose a probabilistic quan-
                                                                     2 Background
tification, namely privacy at risk, that further leads to
                                                                     We consider a universe of datasets D. We explicitly men-
analytical relation between privacy and randomness. We
                                                                     tion when we consider that the datasets are sampled
derive a composition theorem with privacy at risk for
                                                                     from a data-generation distribution G with support D.
mechanisms with the same as well as varying privacy
                                                                     Two datasets of equal cardinality x and y are said to be
levels. It is an extension of the advanced composition
                                                                     neighbouring datasets if they differ in one data point. A
theorem [12] that deals with a sequential and adaptive
                                                                     pair of neighbouring datasets is denoted by x ∼ y. In
use of privacy-preserving mechanisms. We also prove
                                                                     this work, we focus on a specific class of queries called
that privacy at risk satisfies convexity over privacy levels
                                                                     numeric queries. A numeric query f is a function that
and a weak relaxation of the post-processing property.
                                                                     maps a dataset into a real-valued vector, i.e. f : D → Rk .
To the best of our knowledge, we are the first to ana-
                                                                     For instance, a sum query returns the sum of the values
lytically derive the proposed probabilistic quantification
                                                                     in a dataset.
for the widely used Laplace mechanism [10].
                                                                          In order to achieve a privacy guarantee, researchers
     The privacy level proposed by the differential pri-
                                                                     use a privacy-preserving mechanism, or mechanism in
vacy framework is too abstract a quantity to be inte-
                                                                     short, which is a randomised algorithm that adds noise
grated in a business setting. We propose a cost model
                                                                     to the query from a given family of distributions.
that maps the privacy level to a monetary budget. The
                                                                     Thus, a privacy-preserving mechanism of a given fam-
proposed model is a convex function of the privacy level,
                                                                     ily, M(f, Θ), for the query f and the set of parame-
which further leads to a convex cost model for privacy
                                                                     ters Θ of the given noise distribution, is a function i.e.
at risk. Hence, it has a unique probabilistic privacy level
                                                                     M(f, Θ) : D → R. In the case of numerical queries, R is
that minimises the cost. We illustrate this using a real-
                                                                     Rk . We denote a privacy-preserving mechanism as M,
istic scenario in a GDPR-compliant business entity that
                                                                     when the query and the parameters are clear from the
needs an estimation of the compensation budget that it
                                                                     context.
needs to pay to stakeholders in the unfortunate event
of a personal data breach. The illustration, which uses              Definition 1 (Differential Privacy [12]). A privacy-
the proposed convex cost model, shows that the use of                preserving mechanism M, equipped with a query f and
probabilistic privacy levels avoids overestimation of the            with parameters Θ, is (ε, δ)-differentially private if for
compensation budget without sacrificing utility. The il-             all Z ⊆ Range(M) and x, y ∈ D such that x ∼ y:
lustration naturally extends to any convex cost model.
     In this work, we comparatively evaluate the privacy               P(M(f, Θ)(x) ∈ Z) ≤ eε × P(M(f, Θ)(y) ∈ Z) + δ.
guarantees using privacy at risk of the Laplace mecha-
nism. We quantitatively compare the composition under                An (ε, 0)-differentially private mechanism is also simply
                                                                     said to be ε-differentially private. Often, ε-differential
                                                                     privacy is referred to as pure differential privacy whereas
1 A widely-used (ε, δ)-differential privacy is not a probabilistic
                                                                     (ε, δ)-differential privacy is referred as approximate dif-
relaxation of differential privacy [29].                             ferential privacy.
Differential Privacy at Risk     66

A privacy-preserving mechanism provides perfect pri-
vacy if it yields indistinguishable outputs for all neigh-
                                                                  3 Privacy at Risk: A Probabilistic
bouring input datasets. The privacy level ε quantifies              Quantification of Randomness
the privacy guarantee provided by ε-differential privacy.
For a given query, the smaller the value of the ε, the            The parameters of a privacy-preserving mechanism are
qualitatively higher the privacy. A randomised algo-              calibrated using the privacy level and the sensitivity of
rithm that is ε-differentially private is also ε0 -differential   the query. A data steward needs to choose an appro-
private for any ε0 > ε.                                           priate privacy level for practical implementation. Lee
     In order to satisfy ε-differential privacy, the param-       et al. [25] show that the choice of an actual privacy
eters of a privacy-preserving mechanism requires a cal-           level by a data steward in regard to her business re-
culated calibration. The amount of noise required to              quirements is a non-trivial task. Recall that the privacy
achieve a specified privacy level depends on the query.           level in the definition of differential privacy corresponds
If the output of the query does not change drastically            to the worst case privacy loss. Business users are how-
for two neighbouring datasets, then a small amount of             ever used to taking and managing risks, if the risks can
noise is required to achieve a given privacy level. The           be quantified. For instance, Jorion [21] defines Value at
measure of such fluctuations is called the sensitivity of         Risk that is used by risk analysts to quantify the loss in
the query. The parameters of a privacy-preserving mech-           investments for a given portfolio and an acceptable con-
anism are calibrated using the sensitivity of the query           fidence bound. Motivated by the formulation of Value
that quantifies the smoothness of a numeric query.                at Risk, we propose to use the use of probabilistic pri-
                                                                  vacy level. It provides us with a finer tuning of an ε0 -
Definition 2 (Sensitivity). The sensitivity of a query            differentially private privacy-preserving mechanism for
f : D → Rk is defined as                                          a specified risk γ.

               ∆f , max kf (x) − f (y)k1 .                        Definition 5 (Privacy at Risk). For a given data gen-
                      x,y∈D
                       x∼y
                                                                  erating distribution G, a privacy-preserving mecha-
The Laplace mechanism is a privacy-preserving mecha-              nism M, equipped with a query f and with parame-
nism that adds scaled noise sampled from a calibrated             ters Θ, satisfies ε-differential privacy with a privacy at
Laplace distribution to the numeric query.                        risk 0 ≤ γ ≤ 1 if, for all Z ⊆ Range(M) and x, y sam-
                                                                  pled from G such that x ∼ y:
Definition 3 ([35]). The Laplace distribution with                                                         
                                                                                   P(M(f, Θ)(x) ∈ Z)
mean zero and scale b > 0 is a probability distribution                     P ln                         > ε ≤ γ,        (1)
                                                                                   P(M(f, Θ)(y) ∈ Z)
with probability density function
                                                                where the outer probability is calculated with respect to
                          1        |x|
                Lap(b) ,     exp −       ,                        the probability space Range(M ◦ G) obtained by apply-
                          2b        b
                                                                  ing the privacy-preserving mechanism M on the data-
where x ∈ R. We write Lap(b) to denote a random vari-             generation distribution G.
able X ∼ Lap(b)
                                                                  If a privacy-preserving mechanism is ε0 -differentially
Definition 4 (Laplace Mechanism [10]). Given any                  private for a given query f and parameters Θ, for
function f : D → Rk and any x ∈ D, the Laplace                    any privacy level ε ≥ ε0 , the privacy at risk is 0. We
Mechanism is defined as                                           are interested in quantifying the risk γ with which an
                                                                ε0 -differentially private privacy-preserving mechanism
     ∆f              ∆f                                           also satisfies a stronger ε-differential privacy, i.e., with
   Lε (x) , M f,          (x) = f (x) + (L1 , ..., Lk ),
                      ε                                           ε < ε0 .
                             ∆
                            
where Li is drawn from Lap εf and added to the ith
component of f (x).                                               Unifying Probabilistic and Random DP
                                                       ∆          Interestingly, Equation (1) unifies the notions of proba-
Theorem 1 ([10]). The Laplace mechanism, Lε0f , is
                                                                  bilistic differential privacy and random differential pri-
ε0 -differentially private.
                                                                  vacy by accounting for both sources of randomness in
                                                                  a privacy-preserving mechanism. Machanavajjhala et
Differential Privacy at Risk   67

al. [27] define probabilistic differential privacy that in-         over multiple evaluations with a square root dependence
corporates the explicit randomness of the noise distribu-           on the number of evaluations. In this section, we provide
tion of the privacy-preserving mechanism, whereas Hall              the composition theorem for privacy at risk.
et al. [16] define random differential privacy that incor-
porates the implicit randomness of the data-generation              Definition 6 (Privacy loss random variable). For a
distribution. In probabilistic differential privacy, the            privacy-preserving mechanism M : D → R, any two
outer probability is computed over the sample space of              neighbouring datasets x, y ∈ D and an output r ∈ R, the
Range(M) and all datasets are equally probable.                     value of the privacy loss random variable C is defined
                                                                    as:
                                                                                              P(M(x) = r)
                                                                                    C(r) , ln             .
                                                                                              P(M(y) = r)
Connection with Approximate DP
Despite a resemblance with probabilistic relaxations of             Lemma 1. If a privacy-preserving mechanism M sat-
differential privacy [13, 16, 27] due to the added param-           isfies ε0 -differential privacy, then
eter δ, (ε, δ)-differential privacy (Definition 1) is a non-
probabilistic variant [29] of regular ε-differential privacy.                                   P[|C| ≤ ε0 ] = 1.
Indeed, unlike the auxiliary parameters in probabilis-
                                                                    Theorem 3. For all ε0 , ε, γ, δ > 0, the class of ε0 -
tic relaxations, such as γ in privacy at risk (ref. Def-
                                                                    differentially private mechanisms, which satisfy (ε, γ)-
inition 5), the parameter δ of approximate differential
                                                                    privacy at risk under a uniform data-generation distri-
privacy is an absolute slack that is independent of the
                                                                    bution, are (ε0 , δ)-differential privacy under n-fold com-
sources of randomness. For a specified choice of ε and
                                                                    position where
δ, one can analytically compute a matching value of δ
for a new value of ε2 . Therefore, as other probabilistic
                                                                                               r
                                                                                         0              1
relaxations, privacy at risk cannot be directly related                                 ε = ε0 2n ln + nµ,
                                                                                                        δ
to approximate differential privacy. An alternative is to
find out a privacy at risk level γ for a given privacy level        where µ = 12 [γε2 + (1 − γ)ε20 ].
(ε, δ) while the original noise satisfies (ε0 , δ).
                                                                    Proof. Let, M1...n : D → R1 × R2 × ... × Rn denote the
Theorem 2. If a privacy preserving mechanism satis-                 n-fold composition of privacy-preserving mechanisms
                                                                    {Mi : D → Ri }n    i=1 . Each ε0 -differentially private M
                                                                                                                               i
fies (ε, γ) privacy at risk, it also satisfies (ε, γ) approxi-
mate differential privacy.                                          also satisfies (ε, γ)-privacy at risk for some ε ≤ ε0 and
                                                                    appropriately computed γ. Consider any two neighbour-
We obtain this reduction as the probability measure                 ing datasets x, y ∈ D. Let,
induced by the privacy preserving mechanism and                                 (               n
                                                                                                                         )
                                                                                               ^  P(Mi (x) = ri )
data generating distribution on any output set Z ⊆                        B = (r1 , ..., rn )                       > eε
                                                                                                  P(Mi (y) = ri )
Range(M) is additive. 3 The proof of the theorem is                                                 i=1
in Appendix A.                                                      Using the technique in [12, Theorem 3.20], it suffices to
                                                                    show that P(M1...n (x) ∈ B) ≤ δ.
                                                                        Consider
3.1 Composition Theorem
                                                                                         P(M1...n (x) = (r1 , ..., rn ))
                                                                                    ln
The application of ε-differential privacy to many real-                                  P(M1...n (y) = (r1 , ..., rn ))
                                                                                         n
world problem suffers from the degradation of privacy                                    Y P(Mi (x) = ri )
                                                                                = ln
guarantee, i.e., privacy level, over the composition. The                                      P(Mi (y) = ri )
                                                                                     i=1
basic composition theorem [12] dictates that the pri-                               n                            n
vacy guarantee degrades linearly in the number of eval-
                                                                                    X          P(Mi (x) = ri )   X
                                                                                =         ln                   ,   Ci             (2)
uations of the mechanism. The advanced composition                                             P(Mi (y) = ri )
                                                                                    i=1                               i=1
theorem [12] provides a finer analysis of the privacy loss
                                                                    where C i in the last line denotes the privacy loss random
                                                                    variable related to Mi .
2 For any 0 < ε0 ≤ ε, any (ε, δ)-differentially private mechanism       Consider an ε-differentially private mechanism Mε
                             0
also satisfies (ε0 , (eε − eε + δ))-differential privacy.           and ε0 -differentially private mechanism Mε0 . Let Mε0
3 The converse is not true as explained before.                     satisfy (ε, γ)-privacy at risk for ε ≤ ε0 and appropriately
Differential Privacy at Risk   68

computed γ. Each Mi can be simulated as the mech-                A detailed discussion and analysis of proving such het-
anism Mε with probability γ and the mechanism Mε0                erogeneous composition theorems is available in [22,
otherwise. Therefore, the privacy loss random variable           Section 3.3].
for each mechanism Mi can be written as                              In fact, if we consider both sources of randomness,
                                                                 the expected value of the loss function must be com-
                       C i = γCεi + (1 − γ)Cεi 0                 puted by using the law of total expectation.

where Cεi denotes the privacy loss random variable as-                            E[C] = Ex,y∼G [E[C]|x, y]
sociated with the mechanism Mε and Cεi 0 denotes the             Therefore, the exact computation of privacy guaran-
privacy loss random variable associated with the mech-           tees after the composition requires access to the data-
anism Mε0 . Using [5, Remark 3.4], we can bound the              generation distribution. We assume a uniform data-
mean of every privacy loss random variable as:                   generation distribution while proving Theorem 3. We
                                   1 2                           can obtain better and finer privacy guarantees account-
             µ , E[C i ] ≤           [γε + (1 − γ)ε20 ].
                                   2                             ing for data-generation distribution, which we keep as a
                                                                 future work.
We have a collection of n independent privacy random
variables C i ’s such that P |C i | ≤ ε0 = 1. Using Hoeffd-
                                       

ing’s bound [18] on the sample mean for any β > 0,
         "                         #
                                                                 3.2 Convexity and Post-Processing
                                              nβ 2
                                                  
           1X i
       P                    i
                   C ≥ E[C ] + β ≤ exp − 2 .                     We show that privacy at risk satisfies the convexity
           n                                   2ε0
               i                                                 property and does not satisfy the post-processing prop-
Rearranging the inequality by renaming the upper                 erty.
bound on the probability as δ, we get:
                                                                 Lemma 2 (Convexity). For a given ε0 -differentially
           "                           #
                                                                 private privacy-preserving mechanism, privacy at risk
                             r
            X
                 i                   1
         P      C ≥ nµ + ε0 2n ln        ≤ δ.                    satisfies the convexity property.
                                     δ
                   i

                                                                 Proof. Let M be a mechanism that satisfies ε0 -
                                                                 differential privacy. By the definition of the privacy at
Theorem 3 is an analogue, in the privacy at risk setting,        risk, it also satisfies (ε1 , γ1 )-privacy at risk as well as
of the advanced composition of differential privacy [12,         (ε2 , γ2 )-privacy at risk for some ε1 , ε2 ≤ ε0 and appro-
Theorem 3.20] under a constraint of independent evalu-           priately computed values of γ1 and γ2 . Let M1 and
ations. Note that if one takes γ = 0, then we obtain the         M2 denote the hypothetical mechanisms that satisfy
exact same formula as in [12, Theorem 3.20]. It provides         (ε1 , γ1 )-privacy at risk and (ε2 , γ2 )-privacy at risk re-
a sanity check for the consistency of composition using          spectively. We can write privacy loss random variables
privacy at risk.                                                 as follows:

Corollary 1 (Heterogeneous Composition). For all                                   C 1 ≤ γ1 ε1 + (1 − γ1 )ε0
εl , ε, γl , δ > 0 and l ∈ {1, . . . , n}, the composition of                      C 2 ≤ γ2 ε2 + (1 − γ2 )ε0
{εl }nl=1 -differentially private mechanisms, which satisfy      where C 1 and C 2 denote privacy loss random variables
(ε, γl )-privacy at risk under a uniform data-generation
                                                                 for M1 and M2 .
distribution, also satisfies (ε0 , δ)-differential privacy
                                                                     Let us consider a privacy-preserving mechanism M
where                  v
                       u      n
                                    !                            that uses M1 with a probability p and M2 with a prob-
                   0
                       u X
                                  2       1                      ability (1−p) for some p ∈ [0, 1]. By using the techniques
                  ε = 2t         εl ln + µ,
                                          δ                      in the proof of Theorem 3, the privacy loss random vari-
                                   l=1
                                                                 able C for M can be written as:
where µ = 21 [ε2 (                                − γl )ε2l ].
                       Pn              Pn
                        l=1 γl )   +     l=1 (1
                                                                                    C = pC 1 + (1 − p)C 2
Proof. The proof follows from the same argument as                                      ≤ γ 0 ε0 + (1 − γ 0 )ε0
that of Theorem 3 of bounding the loss random variable
                                                                 where
at step l using γl Cεl + (1 − γl )Cεl l and then applying the
                                                                                        pγ1 ε1 + (1 − p)γ2 ε2
concentration inequality.                                                        ε0 =
                                                                                          pγ1 + (1 − p)γ2
Differential Privacy at Risk   69

               γ 0 = (1 − pγ1 − (1 − p)γ2 )                   at risk. Therefore, we keep privacy at risk for Gaussian
                                                              mechanism as the future work.
Thus, M satisfies (ε0 , γ 0 )-privacy at risk. This proves        In this section, we instantiate privacy at risk for the
that privacy at risk satisfies convexity [23, Axiom 2.1.2].   Laplace mechanism in three cases: two cases involving
                                                              two sources of randomness and a third case involving the
Meiser [29] proved that a relaxation of differential pri-     coupled effect. These three different cases correspond to
vacy that provides probabilistic bounds on the privacy        three different interpretations of the confidence level,
loss random variable does not satisfy post-processing         represented by the parameter γ, corresponding to three
property of differential privacy. Privacy at risk is indeed   interpretations of the support of the outer probability
such a probabilistic relaxation.                              in Definition 5. In order to highlight this nuance, we
                                                              denote the confidence levels corresponding to the three
Corollary 2 (Post-processing). Privacy at risk does           cases and their three sources of randomness as γ1 , γ2 ,
not satisfy the post-processing property for every pos-       and γ3 , respectively.
sible mapping of the output.

Though privacy at risk is not preserved after post-           4.1 The Case of Explicit Randomness
processing, it yields a weaker guarantee in terms of ap-
proximate differential privacy after post-processing. The     In this section, we study the effect of the explicit ran-
proof involves reduction of privacy at risk to approxi-       domness induced by the noise sampled from Laplace
mate differential privacy and preservation of approxi-        distribution. We provide a probabilistic quantification
mate differential privacy under post-processing.              for fine tuning for the Laplace mechanism. We fine-tune
                                                              the privacy level for a specified risk under by assuming
Lemma 3 (Weak Post-processing). Let M : D → R ⊆               that the sensitivity of the query is known a priori.
Rk be a mechanism that satisfy (ε, γ)-privacy at risk and                                      ∆
                                                                   For a Laplace mechanism Lε0f calibrated with sensi-
f : R → R0 be any arbitrary data independent map-             tivity ∆f and privacy level ε0 , we present the analytical
ping. Then, f ◦ M : D → R0 would also satisfy (ε, γ)-         formula relating privacy level ε and the risk γ1 in The-
approximate differential privacy.                             orem 4. The proof is available in Appendix B.

Proof. Let us fix a pair of neighbouring datasets x and       Theorem 4. The risk γ1 ∈ [0, 1] with which a Laplace
y, and also an event Z 0 ⊆ R0 . Let us define pre-image                      ∆
                                                              Mechanism Lε0f , for a numeric query f : D → Rk sat-
of Z 0 as Z , {r ∈ R : f (r) ∈ Z}. Now, we get                isfies a privacy level ε ≥ 0 is given by
      P(f ◦ M(x) ∈ Z 0 ) = P(M(x) ∈ Z)                                                     P(T ≤ ε)
                                                                                    γ1 =              ,                (3)
                               ε
                         ≤ e P(M(y) ∈ Z) + γ                                               P(T ≤ ε0 )
                         (a)
                                                              where T is a random variable that follows a distribution
                         = eε P(f ◦ M(y) ∈ Z 0 ) + δ
                                                              with the following density function.
(a) is a direct consequence of Theorem 2.
                                                                                       21−k tk− 2 Kk− 1 (t)ε0
                                                                                                 1

                                                                              PT (t) =     √          2

                                                                                             2πΓ(k)∆f

4 Privacy at Risk for Laplace                                 where Kn− 1 is the Bessel function of second kind.
                                                                          2

  Mechanism                                                   Figure 1a shows the plot of the privacy level against
                                                              risk for different values of k and for a Laplace mecha-
The Laplace and Gaussian mechanisms are widely used           nism L1.0
                                                                      1.0 . As the value of k increases, the amount of
privacy-preserving mechanisms in the literature. The          noise added in the output of numeric query increases.
Laplace mechanism satisfies pure ε-differential privacy       Therefore, for a specified privacy level, the privacy at
whereas the Gaussian mechanism satisfies approximate          risk level increases with the value of k.
(ε, δ)-differential privacy. As previously discussed, it is       The analytical formula representing γ1 as a func-
not straightforward to establish a connection between         tion of ε is bijective. We need to invert it to obtain the
the non-probabilistic parameter δ of approximate differ-      privacy level ε for a privacy at risk γ1 . However the an-
ential privacy and the probabilistic bound γ of privacy       alytical closed form for such an inverse function is not
Differential Privacy at Risk   70

                                                                                                      ∆S
explicit. We use a numerical approach to compute pri-             For the Laplace mechanism Lε f calibrated with
vacy level for a given privacy at risk from the analytical    sampled sensitivity ∆Sf and privacy level ε, we evalu-
formula of Theorem 4.                                         ate the empirical risk γˆ2 . We present the result in The-
    Result for a Real-valued Query. For the case              orem 5. The proof is available in Appendix C.
k = 1, the analytical derivation is fairly straightfor-
ward. In this case, we obtain an invertible closed-form       Theorem 5. Analytical bound on the empirical risk,
                                                                                                ∆S
of a privacy level for a specified risk. It is presented in   γˆ2 , for Laplace mechanism Lε f with privacy level ε
Equation 4.                                                   and sampled sensitivity ∆Sf for a query f : D → Rk is
                                           
                                 1                                                                    2
                ε = ln                                  (4)                     γˆ2 ≥ γ2 (1 − 2e−2ρ n )               (5)
                         1 − γ1 (1 − e−ε0 )
                                                              where n is the number of samples used for estimation of
     Remarks on ε0 . For k = 1, Figure 1b shows the
                                                              the sampled sensitivity and ρ is the accuracy parameter.
plot of privacy at risk level ε versus privacy at risk γ1
                                                              γ2 denotes the specified absolute risk.
for the Laplace mechanism L1.0  ε0 . As the value of ε0 in-
creases, the probability of Laplace mechanism generat-
                                                              The error parameter ρ controls the closeness between
ing higher value of noise reduces. Therefore, for a fixed
                                                              the empirical cumulative distribution of the sensitivity
privacy level, privacy at risk increases with the value of
                                                              to the true cumulative distribution of the sensitivity.
ε0 . The same observation is made for k > 1.
                                                              Lower the value of the error, closer is the empirical cu-
                                                              mulative distribution to the true cumulative distribu-
                                                              tion. Mathematically,
4.2 The Case of Implicit Randomness
                                                                             ρ ≥ sup |FSn (∆) − FS (∆)|,
                                                                                   ∆
In this section, we study the effect of the implicit ran-
domness induced by the data-generation distribution to        where  FSn   is the empirical cumulative distribution of
provide a fine tuning for the Laplace mechanism. We           sensitivity after n samples and FS is the actual cumu-
fine-tune the risk for a specified privacy level without      lative distribution of sensitivity.
assuming that the sensitivity of the query.                        Figure 2 shows the plot of number of samples as a
     If one takes into account randomness induced by          function of the privacy at risk and the error parameter.
the data-generation distribution, all pairs of neighbour-     Naturally, we require higher number of samples in order
ing datasets are not equally probable. This leads to es-      to have lower error rate. The number of samples reduces
timation of sensitivity of a query for a specified data-      as the privacy at risk increases. The lower risk demands
generation distribution. If we have access to an ana-         precision in the estimated sampled sensitivity, which in
lytical form of the data-generation distribution and to       turn requires larger number of samples.
the query, we could analytically derive the sensitivity            If the analytical form of the data-generation distri-
distribution for the query. In general, we have access        bution is not known a priori, the empirical distribution
to the datasets, but not the data-generation distribu-        of sensitivity can be estimated in two ways. The first
tion that generates them. We, therefore, statistically        way is to fit a known distribution on the available data
estimate sensitivity by constructing an empirical dis-        and later use it to build an empirical distribution of the
tribution. We call the sensitivity value obtained for a       sensitivities. The second way is to sub-sample from a
specified risk from the empirical cumulative distribu-        large dataset in order to build an empirical distribution
tion of sensitivity the sampled sensitivity (Definition 7).   of the sensitivities. In both of these ways, the empirical
However, the value of sampled sensitivity is simply an        distribution of sensitivities captures the inherent ran-
estimate of the sensitivity for a specified risk. In or-      domness in the data-generation distribution. The first
der to capture this additional uncertainty introduced         way suffers from the goodness of the fit of the known
by the estimation from the empirical sensitivity distri-      distribution to the available data. An ill-fit distribution
bution rather than the true unknown distribution, we          does not reflect the true data-generation distribution
compute a lower bound on the accuracy of this esti-           and hence introduces errors in the sensitivity estima-
mation. This lower bound yields a probabilistic lower         tion. Since the second way involves subsampling, it is
bound on the specified risk. We refer to it as empirical      immune to this problem. The quality of sensitivity es-
risk. For a specified absolute risk γ2 , we denote by γˆ2     timates obtained by sub-sampling the datasets depend
corresponding empirical risk.                                 on the availability of large population.
Differential Privacy at Risk                      71

                    1.0                                             k =1                       1.0                                             ǫ0 =1.0                                      ρ = 0.0001
                                                                    k =2                                                                       ǫ0 =1.5                          108         ρ = 0.0010
                                                                    k =3                                                                       ǫ0 =2.0                                      ρ = 0.0100
                                                                                                                                               ǫ0 =2.5
                    0.8                                                                        0.8                                                                                          ρ = 0.1000
                                                                                                                                                                                107

                                                                                                                                                                                106
Privacy level (ǫ)

                                                                           Privacy level (ǫ)
                    0.6                                                                        0.6

                                                                                                                                                              Sample size (n)
                                                                                                                                                                                105
                    0.4                                                                        0.4
                                                                                                                                                                                104

                    0.2                                                                        0.2                                                                              103

                                                                                                                                                                                102
                    0.0                                                                        0.0

                          0.0   0.2   0.4             0.6     0.8   1.0                              0.0   0.2   0.4             0.6     0.8      1.0
                                      Privacy at Risk (γ1 )                                                      Privacy at Risk (γ1 )                                                0.0         0.2     0.4            0.6      0.8   1.0
                                                                                                                                                                                                          Privacy at Risk (γ2 )

(a)                                                                        (b)
Fig. 1. Privacy level ε for varying privacy at risk γ1 for Laplace mechanism L1.0
                                                                              ε0 . In                                                                         Fig. 2. Number of samples n for varying pri-
Figure 1a, we use ε0 = 1.0 and different values of k. In Figure 1b, for k = 1 and                                                                             vacy at risk γ2 for different error parameter
different values of ε0 .                                                                                                                                      ρ.

    Let, G denotes the data-generation distribution, ei-                                                                             4.3 The Case of Explicit and Implicit
ther known apriori or constructed by subsampling the                                                                                     Randomness
available data. We adopt the procedure of [38] to sam-
ple two neighbouring datasets with p data points each.                                                                               In this section, we study the combined effect of both
We sample p − 1 data points from G that are common to                                                                                explicit randomness induced by the noise distribution
both of these datasets and later two more data points,                                                                               and implicit randomness in the data-generation distri-
independently. From those two points, we allot one data                                                                              bution respectively. We do not assume the knowledge of
point to each of the two datasets.                                                                                                   the sensitivity of the query.
    Let, Sf = kf (x) − f (y)k1 denotes the sensitivity                                                                                   We estimate sensitivity using the empirical cumula-
random variable for a given query f , where x and y                                                                                  tive distribution of sensitivity. We construct the empiri-
are two neighbouring datasets sampled from G. Using                                                                                  cal distribution over the sensitivities using the sampling
n pairs of neighbouring datasets sampled from G, we                                                                                  technique presented in the earlier case. Since we use
construct the empirical cumulative distribution, Fn , for                                                                            the sampled sensitivity (Definition 7) to calibrate the
the sensitivity random variable.                                                                                                     Laplace mechanism, we estimate the empirical risk γˆ3 .
                                                                                                                                                                                                         ∆S
                                                                                                                                         For Laplace mechanism Lε0 f calibrated with sam-
Definition 7. For a given query f and for a specified
                                                                                                                                     pled sensitivity ∆Sf and privacy level ε0 , we present
risk γ2 , sampled sensitivity, ∆Sf , is defined as the value
                                                                                                                                     the analytical bound on the empirical sensitivity γˆ3 in
of sensitivity random variable that is estimated using
                                                                                                                                     Theorem 6 with proof in the Appendix D.
its empirical cumulative distribution function, Fn , con-
structed using n pairs of neighbouring datasets sampled                                                                              Theorem 6. Analytical bound on the empirical risk
from the data-generation distribution G.                                                                                             γˆ3 ∈ [0, 1] to achieve a privacy level ε > 0 for Laplace
                                                                                                                                                         ∆S
                                                     ∆Sf , Fn−1 (γ2 )                                                                mechanism Lε0 f with sampled sensitivity ∆Sf of a
                                                                                                                                     query f : D → Rk is
                                                                                                                                                                                                                  2
                                                                                                                                                                 γˆ3 ≥ γ3 (1 − 2e−2ρ n )                                                      (6)
If we knew analytical form of the data generation dis-
                                                                                                                                     where n is the number of samples used for estimating
tribution, we could analytically derive the cumulative
                                                                                                                                     the sensitivity, ρ is the accuracy parameter. γ3 denotes
distribution function of the sensitivity, F , and find the
                                                                                                                                     the specified absolute risk defined as:
sensitivity of the query as ∆f = F −1 (1). Therefore, in
order to have the sampled sensitivity close to the sensi-                                                                                                                                    P(T ≤ ε)
                                                                                                                                                                  γ3 =                                  · γ2
tivity of the query, we require the empirical cumulative                                                                                                                                    P(T ≤ ηε0 )
distributions to be close to the cumulative distribution                                                                             Here, η is of the order of the ratio of the true sensitivity
of the sensitivity. We use this insight to derive the ana-                                                                           of the query to its sampled sensitivity.
lytical bound in the Theorem 5.
                                                                                                                                     The error parameter ρ controls the closeness between
                                                                                                                                     the empirical cumulative distribution of the sensitivity
                                                                                                                                     to the true cumulative distribution of the sensitivity.
                                                                                                                                     Figure 3 shows the dependence of the error parameter
Differential Privacy at Risk            72

                    1.0                                                 ρ=0.010                         1.0                                              n=10000
                                                                        ρ=0.012                                                                          n=15000
                                                                        ρ=0.015                                                                          n=20000
                                                                        ρ=0.020                                                                          n=25000
                    0.8                                                                                 0.8
Privacy level (ǫ)

                                                                                    Privacy level (ǫ)
                    0.6                                                                                 0.6

                    0.4                                                                                 0.4

                    0.2                                                                                 0.2

                    0.0                                                                                 0.0

                          0.2          0.4              0.6       0.8       1.0                               0.2    0.4                0.6        0.8       1.0
                                          Privacy at Risk (γ3 )                                                            Privacy at Risk (γ3 )

(a)                                                                                (b)
                                                                                                                       ∆S
Fig. 3. Dependence of error and number of samples on the privacy at risk for Laplace mechanism L1.0 f . For the figure on the left
hand side, we fix the number of samples to 10000. For the Figure 3b we fix the error parameter to 0.01.

on the number of samples. In Figure 3a, we observe that
for a fixed number of samples and a privacy level, the
                                                                                  5 Minimising Compensation
privacy at risk decreases with the value of error param-                            Budget for Privacy at Risk
eter. For a fixed number of samples, smaller values of
the error parameter reduce the probability of similarity                          Many service providers collect users’ data to enhance
between the empirical cumulative distribution of sensi-                           user experience. In order to avoid misuse of this data,
tivity and the true cumulative distribution. Therefore,                           we require a legal framework that not only limits the
we observe the reduction in the risk for a fixed privacy                          use of the collected data but also proposes reparative
level. In Figure 3b, we observe that for a fixed value of                         measures in case of a data leak. General Data Protection
error parameter and a fixed level of privacy level, the                           Regulation (GDPR)4 is such a legal framework.
risk increases with the number of samples. For a fixed                                 Section 82 in GDPR states that any person who suf-
value of the error parameter, larger values of the sam-                           fers from material or non-material damage as a result of
ple size increase the probability of similarity between                           a personal data breach has the right to demand compen-
the empirical cumulative distribution of sensitivity and                          sation from the data processor. Therefore, every GDPR
the true cumulative distribution. Therefore, we observe                           compliant business entity that either holds or processes
the increase in the risk for a fixed privacy level.                               personal data needs to secure a certain budget in the
     Effect of the consideration of implicit and explicit                         scenario of the personal data breach. In order to re-
randomness is evident in the analytical expression for                            duce the risk of such an unfortunate event, the business
γ3 in Equation 7. Proof is available in Appendix D. The                           entity may use privacy-preserving mechanisms that pro-
privacy at risk is composed of two factors whereas the                            vide provable privacy guarantees while publishing their
second term is a privacy at risk that accounts for inher-                         results. In order to calculate the compensation budget
ent randomness. The first term takes into account the                             for a business entity, we devise a cost model that maps
implicit randomness of the Laplace distribution along                             the privacy guarantees provided by differential privacy
with a coupling coefficient η. We define η as the ratio                           and privacy at risk to monetary costs. The discussions
of the true sensitivity of the query to its sampled sen-                          demonstrate the usefulness of probabilistic quantifica-
sitivity. We provide an approximation to estimate η in                            tion of differential privacy in a business setting.
the absence of knowledge of the true sensitivity. It can
be found in Appendix D.

                                        P(T ≤ ε)
                                γ3 ,               · γ2                     (7)
                                       P(T ≤ ηε0 )

                                                                                  4 https://gdpr-info.eu/
Differential Privacy at Risk    73

5.1 Cost Model for Differential Privacy                         Equation 9.

Let E be the compensation budget that a business en-                          Eεpar
                                                                                 0
                                                                                    (ε, γ) , γEεdp + (1 − γ)Eεdp
                                                                                                               0
                                                                                                                         (9)
tity has to pay to every stakeholder in case of a per-              Note that the analysis in this section is specific to
sonal data breach when the data is processed without            the cost model in Equation 8. It naturally extends to
any provable privacy guarantees. Let Eεdp be the com-           any choice of convex cost model.
pensation budget that a business entity has to pay to
every stakeholder in case of a personal data breach when
the data is processed with privacy guarantees in terms          5.2.1 Existence of Minimum Compensation Budget
of ε-differential privacy.
     Privacy level, ε, in ε-differential privacy is the quan-   We want to find the privacy level, say εmin , that yields
tifier of indistinguishability of the outputs of a privacy-     the lowest compensation budget. We do that by min-
preserving mechanism when two neighbouring datasets             imising Equation 9 with respect to ε.
are provided as inputs. When the privacy level is zero,
the privacy-preserving mechanism outputs all results            Lemma 4. For the choice of cost model in Equation 8,
with equal probability. The indistinguishability reduces        Eεpar
                                                                   0 (ε, γ) is a convex function of ε.

with increase in the privacy level. Thus, privacy level of
zero bears the lowest risk of personal data breach and          By Lemma 4, there exists a unique εmin that minimises
the risk increases with the privacy level. Eεdp needs to        the compensation budget for a specified parametrisa-
be commensurate to such a risk and, therefore, it needs         tion, say ε0 . Since the risk γ in Equation 9 is itself
to satisfy the following constraints.                           a function of privacy level ε, analytical calculation of
1. For all ε ∈ R≥0 , Eεdp ≤ E.                                  εmin is not possible in the most general case. When the
2. Eεdp is a monotonically increasing function of ε.            output of the query is a real number, i. e. k = 1, we de-
3. As ε → 0, Eεdp → Emin where Emin is the unavoid-             rive the analytic form (Equation 4) to compute the risk
     able cost that business entity might need to pay in        under the consideration of explicit randomness. In such
     case of personal data breach even after the privacy        a case, εmin is calculated by differentiating Equation 9
     measures are employed.                                     with respect to ε and equating it to zero. It gives us
4. As ε → ∞, Eεdp → E.                                          Equation 10 that we solve using any root finding tech-
                                                                nique such as Newton-Raphson method [37] to compute
There are various functions that satisfy these con-             εmin .
straints. In absence of any further constraints, we model
                                                                                            1 − eε
                                                                                                  
                                                                                 1                     1
Eεdp as defined in Equation (8).                                                   − ln 1 −          =                  (10)
                                                                                 ε            ε2       ε0
                                        c
                  Eεdp , Emin + Ee− ε .                  (8)
Eεdp has two parameters, namely c > 0 and Emin ≥ 0.             5.2.2 Fine-tuning Privacy at Risk
c controls the rate of change in the cost as the privacy
level changes and Emin is a privacy level independent           For a fixed budget, say B, re-arrangement of Equation 9
bias. For this study, we use a simplified model with c = 1      gives us an upper bound on the privacy level ε. We use
and Emin = 0.                                                   the cost model with c = 1 and Emin = 0 to derive the
                                                                upper bound. If we have a maximum permissible ex-
                                                                pected mean absolute error T , we use Equation 12 to
5.2 Cost Model for Privacy at Risk                              obtain a lower bound on the privacy at risk level. Equa-
                                                                tion 11 illustrates the upper and lower bounds that dic-
Let, Eεpar
         0 (ε, γ) be the compensation that a business en-       tate the permissible range of ε that a data publisher can
tity has to pay to every stakeholder in case of a per-          promise depending on the budget and the permissible
sonal data breach when the data is processed with an            error constraints.
ε0 -differentially private privacy-preserving mechanism                                                 −1
                                                                            1                   γE
along with a probabilistic quantification of privacy level.                   ≤ ε ≤ ln                               (11)
Use of such a quantification allows us to provide a
                                                                           T              B − (1 − γ)Eεdp
                                                                                                        0

stronger privacy guarantee viz. ε < ε0 for a specified               Thus, the privacy level is constrained by the ef-
privacy at risk at most γ. Thus, we calculate Eεpar
                                                  0  using      fectiveness requirement from below and by the mone-
Differential Privacy at Risk    74

                                                                                         sure of effectiveness for the Laplace mechanism.
                                                                                                                              1
                    120000
                                                                                                          E |L1ε (x) − f (x)| =                 (12)
                                                                                                           
                                                                                                                                ε
                    100000                                                               Equation 12 makes use of the fact that the sensitivity of
Bpar (in dollars)

                                                                                         the count query is one. Suppose that the health centre
                     80000                                                               requires the expected mean absolute error of at most
                                                                                         two in order to maintain the quality of the published
                                                                                         statistics. In this case, the privacy level has to be at
                     60000
                                                                                         least 0.5.
                                   ǫ0 =0.7
                                   ǫ0 =0.6                                                    In order to compute the budget, the health cen-
                     40000         ǫ0 =0.5
                                                                                         tre requires an estimate of E. Moriarty et al. [30] show
                             0.0       0.1   0.2    0.3        0.4
                                                   Privacy level (ǫ)
                                                                       0.5   0.6   0.7
                                                                                         that the incremental cost of premiums for the health
                                                                                         insurance with morbid obesity ranges between $5467 to
Fig. 4. Variation in the budget for Laplace mechanism L1ε0 under                         $5530. With reference to this research, the health cen-
privacy at risk considering explicit randomness in the Laplace                           tre takes $5500 as an estimate of E. For the staff size
mechanism for the illustration in Section 5.3.                                           of 100 and the privacy level 0.5, the health centre uses
tary budget from above. [19] calculate upper and lower                                   Equation 8 in its simplified setting to compute the total
bound on the privacy level in the differential privacy.                                  budget of $74434.40.
They use a different cost model owing to the scenario                                         Is it possible to reduce this budget without degrad-
of research study that compensates its participants for                                  ing the effectiveness of the Laplace mechanism? We
their data and releases the results in a differentially                                  show that it is possible by fine-tuning the Laplace mech-
private manner. Their cost model is different than our                                   anism. Under the consideration of the explicit random-
GDPR inspired modelling.                                                                 ness introduced by the Laplace noise distribution, we
                                                                                         show that ε0 -differentially private Laplace mechanism
                                                                                         also satisfies ε-differential privacy with risk γ, which is
5.3 Illustration                                                                         computed using the formula in Theorem 4. Fine-tuning
                                                                                         allows us to get a stronger privacy guarantee, ε < ε0
Suppose that the health centre in a university that com-                                 that requires a smaller budget. In Figure 4, we plot the
plies to GDPR publishes statistics of its staff health                                   budget for various privacy levels. We observe that the
checkup, such as obesity statistics, twice in a year. In                                 privacy level 0.274, which is same as εmin computed
January 2018, the health centre publishes that 34 out of                                 by solving Equation 10, yields the lowest compensation
99 faculty members suffer from obesity. In July 2018, the                                budget of $37805.86. Thus, by using privacy at risk, the
health centre publishes that 35 out of 100 faculty mem-                                  health centre is able to save $36628.532 without sacri-
bers suffer from obesity. An intruder, perhaps an analyst                                ficing the quality of the published results.
working for an insurance company, checks the staff list-
ings in January 2018 and July 2018, which are publicly
available on website of the university. The intruder does                                5.4 Cost Model and the Composition of
not find any change other than the recruitment of John                                       Laplace Mechanisms
Doe in April 2018. Thus, with high probability, the in-
truder deduces that John Doe suffers from obesity. In                                    Convexity of the proposed cost function enables us to es-
order to avoid such a privacy breach, the health centre                                  timate the optimal value of the privacy at risk level. We
decides to publish the results using the Laplace mecha-                                  use the optimal privacy value to provide tighter bounds
nism. In this case, the Laplace mechanism operates on                                    on the composition of Laplace mechanism. In Figure 5,
the count query.                                                                         we compare the privacy guarantees obtained by using
    In order to control the amount of noise, the health                                  basic composition theorem [12], advanced composition
centre needs to appropriately set the privacy level. Sup-                                theorem [12] and the composition theorem for privacy
pose that the health centre decides to use the expected                                  at risk. We comparatively evaluate them for composi-
mean absolute error, defined in Equation 12, as the mea-                                 tion of Laplace mechanisms with privacy levels 0.1, 0.5
                                                                                         and 1.0. We compute the privacy level after composition
                                                                                         by setting δ to 10−5 .
Differential Privacy at Risk                                     75

                                                               Advanced composition for ǫ0 = 0.10, δ = 10−5                                                                          Advanced composition for ǫ0 = 0.50, δ = 10−5                                                                           Advanced composition for ǫ0 = 1.00, δ = 10−5
                                                                                                                                                                                                                                                                                    350
                                         30       Basic Composition[10]                                                                                                 Basic Composition[10]                                                                                                  Basic Composition[10]
                                                  Advanced Composition[10]                                                                                    140       Advanced Composition[10]                                                                                               Advanced Composition[10]
                                                  Composition with Privacy at Risk                                                                                      Composition with Privacy at Risk                                                                            300        Composition with Privacy at Risk
                                         25
                                                                                                                                                              120
 Privacy level after composition (ǫ′ )

                                                                                                                      Privacy level after composition (ǫ′ )

                                                                                                                                                                                                                                            Privacy level after composition (ǫ′ )
                                                                                                                                                                                                                                                                                    250
                                         20                                                                                                                   100

                                                                                                                                                                                                                                                                                    200
                                                                                                                                                               80
                                         15
                                                                                                                                                                                                                                                                                    150
                                                                                                                                                               60
                                         10
                                                                                                                                                                                                                                                                                    100
                                                                                                                                                               40

                                          5
                                                                                                                                                               20                                                                                                                    50

                                          0                                                                                                                     0                                                                                                                     0
                                              0           50           100          150          200      250   300
                                                                                                                                                                    0           50           100          150          200      250   300                                                 0            50           100          150          200      250   300
                                                                         Number of compositions (n)
                                                                                                                                                                                               Number of compositions (n)                                                                                             Number of compositions (n)

(a) L10.1 satisfies (0.08, 0.80)-privacy at risk.                                                                      (b) L10.5 satisfies (0.27, 0.61)-privacy at risk.                                                                     (c) L11.0 satisfies (0.42, 0.54)-privacy at risk.

Fig. 5. Comparing the privacy guarantee obtained by basic composition and advanced composition [12] with the composition obtained
using optimal privacy at risk that minimises the cost of Laplace mechanism L1ε0 . For the evaluation, we set δ = 10−5 .

     We observe that the use of optimal privacy at risk                                                                                                                                                    which optimally satisfies (0.08, 0.8)-privacy at risk. We
provided significantly stronger privacy guarantees as                                                                                                                                                      list the calculated privacy guarantees in Table 1. The re-
compared to the conventional composition theorems.                                                                                                                                                         ported privacy guarantee is the mean privacy guarantee
Advanced composition theorem is known to provide                                                                                                                                                           over 30 experiments.
stronger privacy guarantees for mechanism with smaller
εs. As we observe in Figure 5c and Figure 5b, the compo-
sition provides strictly stronger privacy guarantees than
basic composition, in the cases where the advanced com-
                                                                                                                                                                                                           6 Balancing Utility and Privacy
position fails.
                                                                                                                                                                                                           In this section, we empirically illustrate and discuss the
                                                                                                                                                                                                           steps that a data steward needs to take and the issues
                                                                                                                                                                                                           that she needs to consider in order to realise a required
Comparison with the Moment Accountant
                                                                                                                                                                                                           privacy at risk level ε for a confidence level γ when seek-
                                                                                                                                                                                                           ing to disclose the result of a query.
Papernot et al. [33, 34] empirically showed that the
                                                                                                                                                                                                               We consider a query that returns the parameter
privacy guarantees provided by the advanced compo-
                                                                                                                                                                                                           of a ridge regression [31] for an input dataset. It is a
sition theorem are quantitatively worse than the ones
                                                                                                                                                                                                           basic and widely used statistical analysis tool. We use
achieved by the state-of-the-art moment accountant [1].
                                                                                                                                                                                                           the privacy-preserving mechanism presented by Ligett
The moment accountant evaluates the privacy guaran-
                                                                                                                                                                                                           et al. [26] for ridge regression. It is a Laplace mech-
tee by keeping track of various moments of privacy loss
                                                                                                                                                                                                           anism that induces noise in the output parameters of
random variables. The computation of the moments is
                                                                                                                                                                                                           the ridge regression. The authors provide a theoretical
performed by using numerical methods on the specified
                                                                                                                                                                                                           upper bound on the sensitivity of the ridge regression,
dataset. Therefore, despite the quantitative strength of
                                                                                                                                                                                                           which we refer as sensitivity, in the experiments.
privacy guarantee provided by the moment accountant,
it is qualitatively weaker, in a sense that it is specific to
the dataset used for evaluation, in constrast to advanced
                                                                                                                                                                                                           6.1 Dataset and Experimental Setup.
composition.
     Papernot et al. [33] introduced the PATE frame-
                                                                                                                                                                                                           We conduct experiments on a subset of the 2000 US
work that uses the Laplace mechanism to provide pri-
                                                                                                                                                                                                           census dataset provided by Minnesota Population Cen-
vacy guarantees for a machine learning model trained
                                                                                                                                                                                                           ter in its Integrated Public Use Microdata Series [39].
in an ensemble manner. We comparatively evaluate the
                                                                                                                                                                                                           The census dataset consists of 1% sample of the original
privacy guarantees provided by their moment accoun-
                                                                                                                                                                                                           census data. It spans over 1.23 million households with
tant on MNIST dataset with the privacy guarantees ob-
                                                                                                                                                                                                           records of 2.8 million people. The value of several at-
tained using privacy at risk. We do so by using privacy
                                                                                                                                                                                                           tributes is not necessarily available for every household.
at risk while computing a data dependent bound [33,
                                                                                                                                                                                                           We have therefore selected 212, 605 records, correspond-
Theorem 3]. Under the identical experimental setup,
                                                                                                                                                                                                           ing to the household heads, and 6 attributes, namely,
we use a 0.1-differentially private Laplace mechanism,
Differential Privacy at Risk      76

                                                       Privacy level for moment accountant(ε)
                               δ     #Queries
                                                  with differential privacy [33]   with privacy at risk

                             10−5        100                  2.04                        1.81
                             10−5       1000                  8.03                        5.95
Table 1. Comparative analysis of privacy levels computed using three composition theorems when applied to 0.1-differentially private
Laplace mechanism, which optimally satisfies (0.08, 0.8)-privacy at risk. The observations for the moment accountant on MNIST
datasets are taken from [33].

Age, Gender, Race, Marital Status, Education, Income,                at risk level ε, the confidence level γ1 and the privacy
whose values are available for the 212, 605 records.                 level of noise ε0 . Specifically, for given ε and γ1 , she
     In order to satisfy the constraint in the derivation of         computes ε0 by solving the equation:
the sensitivity of ridge regression [26], we, without loss
of generality, normalise the dataset in the following way.                           γ1 P(T ≤ ε0 ) − P(T ≤ ε) = 0.
We normalise Income attribute such that the values lie               Since the equation does not give an analytical formula
in [0, 1]. We normalise other attributes such that l2 norm           for ε0 , the data steward uses a root finding algorithm
of each data point is unity.                                         such as Newton-Raphson method [37] to solve the above
     All experiments are run on Linux machine with                   equation. For instance, if she needs to achieve a privacy
12-core 3.60GHz Intel® Core i7™processor with 64GB                   at risk level ε = 0.4 with confidence level γ1 = 0.6,
memory. Python® 2.7.6 is used as the scripting lan-                  she can substitute these values in the above equation
guage.                                                               and solve the equation to get the privacy level of noise
                                                                     ε0 = 0.8.
                                                                          Figure 6 shows the variation of privacy at risk level
6.2 Result Analysis                                                  ε and confidence level γ1 . It also depicts the variation of
                                                                     utility loss for different privacy at risk levels in Figure 6.
We train ridge regression model to predict Income using                   In accordance to the data steward’s problem, if she
other attributes as predictors. We split the dataset into            needs to achieve a privacy at risk level ε = 0.4 with
the training dataset (80%) and testing dataset (20%).                confidence level γ1 = 0.6, she obtains the privacy level
We compute the root mean squared error (RMSE) of                     of noise to be ε0 = 0.8. Additionally, we observe that
ridge regression, trained on the training data with regu-            the choice of privacy level 0.8 instead of 0.4 to calibrate
larisation parameter set to 0.01, on the testing dataset.            the Laplace mechanism gives lower utility loss for the
We use it as the metric of utility loss. Smaller the value           data steward. This is the benefit drawn from the risk
of RMSE, smaller the loss in utility. For a given value              taken under the control of privacy at risk.
of privacy at risk level, we compute 50 runs of an ex-                    Thus, she uses privacy level ε0 and the sensitivity
periment of a differentially private ridge regression and            of the function to calibrate Laplace mechanism.
report the means over the 50 runs of the experiment.                      The Case of Implicit Randomness (cf. Sec-
     Let us now provide illustrative experiments under               tion 4.2). In this scenario, the data steward does not
the three different cases. In every scenario, the data               know the sensitivity of ridge regression. She assesses
steward is given a privacy at risk level ε and the con-              that she can afford to sample at most n times from the
fidence level γ and wants to disclose the parameters of              population dataset. She understands the effect of the
a ridge regression model that she trains on the census               uncertainty introduced by the statistical estimation of
dataset. She needs to calibrate the Laplace mechanism                the sensitivity. Therefore, she uses the confidence level
by estimating either its privacy level ε0 (Case 1) or sen-           for empirical privacy at risk γˆ2 .
sitivity (Case 2) or both (Case 3) to achieve the privacy                 Given the value of n, she chooses the value of the
at risk required the ridge regression query.                         accuracy parameter using Figure 2. For instance, if the
     The Case of Explicit Randomness (cf. Sec-                       number of samples that she can draw is 104 , she chooses
tion 4.1). In this scenario, the data steward knows the              the value of the accuracy parameter ρ = 0.01. Next, she
sensitivity for the ridge regression. She needs to compute           uses Equation 13 to determine the value of probabilistic
the privacy level, ε0 , to calibrate the Laplace mecha-              tolerance, α, for the sample size n. For instance, if the
nism. She uses Equation 3 that links the desired privacy             data steward is not allowed to access more than 15, 000
Differential Privacy at Risk         77

Fig. 6. Utility, measured by RMSE (right y-axis), and privacy at      Fig. 7. Empirical cumulative distribution of the sensitivities of ridge
risk level ε for Laplace mechanism (left y-axis) for varying confi-   regression queries constructed using 15000 samples of neighboring
dence levels γ1 .                                                     datasets.

samples, for the accuracy of 0.01 the probabilistic toler-            Equation 25 and Equation 23, she calculates:
ance is 0.9.
                                                                                    γˆ3 P(T ≤ ηε0 ) − αγ2 P(T ≤ ε) = 0
                   α = 1 − 2e(−2ρ n)
                                  2
                                                     (13)
                                                                      She solves such an equation for ε0 using the root find-
She constructs an empirical cumulative distribution over
                                                                      ing technique such as Newton-Raphson method [37]. For
the sensitivities as described in Section 4.2. Such an
                                                                      instance, if she needs to achieve a privacy at risk level
empirical cumulative distribution is shown in Figure 7.
                                                                      ε = 0.4 with confidence levels γˆ3 = 0.9 and γ2 = 0.9, she
Using the computed probabilistic tolerance and desired
                                                                      can substitute these values and the values of tolerance
confidence level γˆ2 , she uses equation in Theorem 5 to
                                                                      parameter and sampled sensitivity, as used in the pre-
determine γ2 . She computes the sampled sensitivity us-
                                                                      vious experiments, in the above equation. Then, solving
ing the empirical distribution function and the confi-
                                                                      the equation leads to the privacy level of noise ε0 = 0.8.
dence level for privacy ∆Sf at risk γ2 . For instance,
                                                                          Thus, she re-calibrates the Laplace mechanism with
using the empirical cumulative distribution in Figure 7
                                                                      privacy level ε0 , sets the number of samples to be n and
she calculates the value of the sampled sensitivity to
                                                                      sampled sensitivity ∆Sf .
be approximately 0.001 for γ2 = 0.4 and approximately
0.01 for γ2 = 0.85
    Thus, she uses privacy level ε, sets the number of
samples to be n and computes the sampled sensitivity                  7 Related Work
∆Sf to calibrate the Laplace mechanism.
    The Case of Explicit and Implicit Random-                         Calibration of Mechanisms. Researchers have pro-
ness (cf. Section 4.3). In this scenario, the data stew-              posed different privacy-preserving mechanisms to make
ard does not know the sensitivity of ridge regression. She            different queries differentially private. These mecha-
is not allowed to sample more than n times from a pop-                nisms can be broadly classified into two categories. In
ulation dataset. For a given confidence level γ2 and the              one category, the mechanisms explicitly add calibrated
privacy at risk ε, she calibrates the Laplace mechanism               noise, such as Laplace noise in the work of [11] or Gaus-
using illustration for Section 4.3. The privacy level in              sian noise in the work of [12], to the outputs of the query.
this calibration yields utility loss that is more than her            In the other category, [2, 6, 17, 41] propose mechanisms
requirement. Therefore, she wants to re-calibrate the                 that alter the query function so that the modified func-
Laplace mechanism in order to reduce utility loss.                    tion satisfies differentially privacy. Privacy-preserving
    For the re-calibration, the data steward uses pri-                mechanisms in both of these categories perturb the orig-
vacy level of the pre-calibrated Laplace mechanism, i.e.              inal output of the query and make it difficult for a ma-
ε, as the privacy at risk level and she provides a new                licious data analyst to recover the original output of
confidence level for empirical privacy at risk γˆ3 . Using            the query. These mechanisms induce randomness us-
You can also read