L8: Introduction to privacy-preserving computations - Privacy-preserving Technologies / LTAT.04.007
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
L8: Introduction to privacy-preserving computations Privacy-preserving Technologies / LTAT.04.007 Dan Bogdanov dan.bogdanov@cyber.ee
Using privacy technologies to solve it
Source data: Estonian
Education Information
10 million tax records, records System's Authority
600 000 education records. Ministry of Education
and Research
Each record upload using secret sharing
(think: “encryption”)
Ministry of
Records linked and processed using Employment
tax records Finance
secure multi-party computation (think: IT Center
Estonian Tax and
“data not decrypted for processing”) Customs Board
Data never existed outside the source in
an unencrypted state.
Cybernetica
Solution based on Sharemind MPC.
5Tax and Aggregate by year
Customs
Board Monthly Average
income yearly income Recover
Extract data results from
Aggregate Expand by years and shares
Employment by month aggregate by person
tax payments
Employment Employment Analysis Analysis
tax payments record of a person results results
Secret share
and upload
?
Merge by Complete record Analysis
person's ID of a person table
Higher study Higher study
Compute additional
events events
attributes and
Extract data Aggregate University career align tax payments
by person of a person
Statistical
Ministry of
Data stored with secret sharing and analyst
Education
and Science processed with secure multi-party computation
7Sharemind-powered Analytics
Data scientists used analytics tools based Estonian
Information
on secure multi-party computation. System's Authority
The MPC system also prevented queries
outside the study plan.
Reports were given to industry, universities
Data Universities
and the government. Ministry of
Finance Analyst Companies
Policymakers
Result: no clear relation between working IT Center
during studies and not graduating.
Cybernetica
8A privacy-preserving statistics tool inspired by R 9
10
Non-IT graduation
rate is around 40%
IT graduation rate
is around 20%
11 Joonis 1. Nominaalajaga lõpetajate osakaal immatrikuleerimisaastate lõikes, IKT- ja mitte-IKT õppekavad,
bakalaureuseõpeNon-IT and IT students have similar employment
ratios, but IT students lost more in the financial crisis
Joonis 4. Nominaalaja jooksul töötanud tudengite osakaal kõigist tudengitest aastati, IKT- ja mitte-IKT
12 õppekavad, bakalaureuseõpeDATA-DRIVEN SERVICES ON CONFIDENTIAL DATA
Regulatory status of the project
In an official response, after a study of the system, the Estonian DPA
suggested that
neither the hosts of the servers running the statistics
nor the analysts making the queries
could feasibly re-identify individuals in the source database (this was pre-GDPR).
The Internal Supervision Department of the Tax and Customs Board agreed
to provide unmodified tax records after a code and process review.
Follow-up legal review in the FP7 PRACTICE by a research from the
University of Göttingen suggested that the same precedent could hold under
GDPR as well.
13A general model for privacy-preserving computing
Concept of secure computing
encrypted
database
standard
When a standard computer tools
encrypts data, it must be
decrypted before analysis
secure
Secure computing systems computing
can analyze data without
removing the encryption.
15Extended definition of secure multi-party computation
Input parties Computing parties Result parties
x11
IP1 x1
...
xk1
CP1 y1
y1
RP1
x1i
... ... ... yj ...
xki
x1l
IPk xk ...
CPl yl
ym RPm
xkl
Step 1: Step 2: Step 3:
upload and secure publishing
16 storage of inputs computing of resultsTechnique: property-preserving cryptography
Analogy: symmetric crypto that preserves a
relation on inputs (e.g., order, equality).
Pros:
Low performance overhead.
Fits well into existing systems.
Cons:
Only allows a few operations (e.g., only
equality comparison or ordering).
Multi-user systems are a challenge, but can
be done with proxy re-encryption.
17Technique: homomorphic encryption
Analogy: asymmetric crypto that allows
addition and multiplication of ciphertexts.
Pros:
Fits well into existing systems.
Cons:
High performance overhead.
Multi-user systems are a challenge, but can
be done with proxy re-encryption.
18Technique: garbled circuits
Analogy: cryptographic versions of electrical
circuits.
Pros:
Flexible programming model.
Cons:
Medium performance overhead.
Fixed number of parties (can be solved by
combining with other techniques).
19Example: millionaire’s problem 20
Technique: secret sharing
Analogy: give a number of people a random
piece of each secret value and let them
collaborate to compute results.
Pros:
Low-to-medium performance overhead.
Flexible programming model.
Cons:
Distributed deployments do not fit into all
existing systems.
21Technique: trusted execution environments
Analogy: think of a computer process that
hides the data from its owner
Ik Pros:
Minimal performance overhead.
Relatively easy to convert applications to work
SC
C on trusted execution environments
Cons:
Side-channel attack mitigations are
Rn complicated to implement.
22Lecture exercise: modelling parties for a COVID-19 social distancing tracking application
Lecture task
Think of an application that would support social distancing and limit infection
rates. Write down very clearly, what is the expected benefit of the system.
Write down the list of input parties and the data they would provide.
Write down the list of computing parties and describe the kind of processing they
would perform.
Write down the list of result parties and describe the outputs they would receive.
Bonus tasks, time permitting:
Think of minimizing personal data processing using process redesign.
See if any of the secure computing paradigms described above could support
your application.
Prepare in 12 minutes and then we’ll have 1-2 students present their ideas.
24Programmable privacy-preserving computations
PDK as an abstraction of a secure computing paradigm
A protection domain kind (PDK) is a set of data representations, algorithms
and protocols for storing and computing on protected data.
Examples:
SMC based on secret sharing,
SMC based on garbled circuits,
(fully) homomorphic encryption,
trusted hardware (e.g., Intel SGX).
26Protection domain as an instance of a PDK
A protection domain (PD) is a set of data that is protected with the same
resources and for which there is a well-defined set of algorithms and
protocols for computing on that data while keeping the protection.
Examples:
data held by a fixed group of servers performing secure multi-party computation,
data encrypted under a fixed key of a homomorphic encryption scheme.
27Application model for privacy-preserving computing
Secure Privacy-
Application
Application primitive preserving
logic
operations algorithms
• private outputs from private • publish selected results to
inputs, make system useful,
• have privacy proofs, • do not leak private inputs or
• remain private under show leakage as acceptable,
sequential or parallel • compositions of secure
composition, primitive operations,
• optimized to have a • optimize for running
low resource footprint. time.
28Converting an algorithm to a privacy-preserving one
We pick frequent itemset mining as a problem of choice.
Frequent itemset mining is a data mining problem that helps with shopping
basket analysis and the simplest kinds of recommender systems.
What kind of things do people buy from stores together most often?
If the service provider knows this, they can recommend one to a customer who is
planning to buy the other.
The simpler algorithms include Apriori (breadth-first search) and Eclat (depth-
first search).
We will know look at the basic primitive of frequent itemset mining and then
build a privacy-preserving approach.
29Privacy-preserving data representations
Private data representations are the key toward
desaigning privacy-preserving algorithms.
nasi chicken
rendang lontong
lemak satay
chicken
t1 rendang nasi lemak
satay t1 1 1 0 1
t2 nasi lemak lontong t2 0 1 1 0
chicken
t3 satay t3 0 0 0 1
t4 rendang nasi lemak t4 1 1 0 0
chicken
t5 nasi lemak
satay t5 0 1 0 1
chicken
t6 nasi lemak
satay t6 0 1 0 1
t7 lontong t7 0 0 1 0
30Calculating the support of an item
The data representation allows for very efficient
calculation of item supports.
nasi chicken nasi
rendang lontong
lemak satay lemak
t1 1 1 0 1 1
t2 0 1 1 0 1
t3 0 0 0 1 0
t4 1 1 0 0 1
t5 0 1 0 1 1
t6 0 1 0 1 1
t7 0 0 1 0 0
31 ∑= 5Calculating support for a set of items
Checking the joint support of a pair of items
simply requires a multiplication
nasi chicken nasi chicken nasi lemak &
rendang lontong chicken satay
lemak satay lemak satay
t1 1 1 0 1 1 x 1 = 1
t2 0 1 1 0 1 x 0 = 0
t3 0 0 0 1 0 x 1 = 0
t4 1 1 0 0 1 x 0 = 0
t5 0 1 0 1 1 x 1 = 1
t6 0 1 0 1 1 x 1 = 1
0 0 1 0 0 x 0 = 0
t7
32 ∑= 3Evaluating itemsets with a depth-first strategy
Depth-first search would be intuitive for pruning.
{ rendang } { nasi lemak } { lontong } { chicken satay }
{rendang,
} {
nasi lemak
rendang,
lontong } {rendang,
} {
chicken satay
nasi lemak,
lontong } { nasi lemak,
} {
chicken satay
lontong,
}
chicken satay
{ } { }{ }{ }
rendang, rendang, rendang, nasi lemak,
nasi lemak, nasi lemak, lontong, lontong
lontong chicken satay chicken satay chicken satay
{ }
rendang,
nasi lemak,
lontong
chicken satay
33Evaluating itemsets with a breadth-first strategy
However, breadth-first search can be done in parallel.
{ rendang } { nasi lemak } { lontong } { chicken satay }
{rendang,
} {
nasi lemak
rendang,
lontong } {rendang,
} {
chicken satay
nasi lemak,
lontong } { nasi lemak,
} {
chicken satay
lontong,
}
chicken satay
{ } { }{ }{ }
rendang, rendang, rendang, nasi lemak,
nasi lemak, nasi lemak, lontong, lontong
lontong chicken satay chicken satay chicken satay
{ }
rendang,
nasi lemak,
lontong
chicken satay
34Balancing optimizations with privacy preservation
Challenge: exploring all possible itemsets leads is slow due to combinatorial
explosion.
Pruning the search tree requires us to declassify itemset supports during
computation (leak?).
Solution: consider that the algorithm will publish all frequent itemsets, as that
is its intended goal.
We will compare support to the threshold privately, only declassifying the
result bit.
We will prune the search tree based on that bit.
Not a leak - if the itemset is frequent, we would have learned it from the
outputs anyway.
35You can also read