Privacy-Preserving Machine Learning: Methods, Challenges and Directions

 
CONTINUE READING
Privacy-Preserving Machine Learning: Methods,
 Challenges and Directions

 Runhua Xu1∗, Nathalie Baracaldo1 and James Joshi2†
 1
 IBM Research - Almaden Research Center, San Jose, CA, United States, 95120
arXiv:2108.04417v1 [cs.LG] 10 Aug 2021

 2 School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States, 15260
 runhua@ibm.com, baracald@us.ibm.com, jjoshi@pitt.edu

 Abstract
 Machine learning (ML) is increasingly being adopted in a wide variety of
 application domains. Usually, a well-performing ML model, especially, emerging
 deep neural network model, relies on a large volume of training data and
 high-powered computational resources. The need for a vast volume of available
 data raises serious privacy concerns because of the risk of leakage of highly
 privacy-sensitive information and the evolving regulatory environments that
 increasingly restrict access to and use of privacy-sensitive data. Furthermore,
 a trained ML model may also be vulnerable to adversarial attacks such as
 membership/property inference attacks and model inversion attacks. Hence,
 well-designed privacy-preserving ML (PPML) solutions are crucial and have
 attracted increasing research interest from academia and industry. More and
 more efforts of PPML are proposed via integrating privacy-preserving techniques
 into ML algorithms, fusing privacy-preserving approaches into ML pipeline, or
 designing various privacy-preserving architectures for existing ML systems. In
 particular, existing PPML arts cross-cut ML, system, security, and privacy; hence,
 there is a critical need to understand state-of-art studies, related challenges, and a
 roadmap for future research. This paper systematically reviews and summarizes
 existing privacy-preserving approaches and proposes a PGU model to guide
 evaluation for various PPML solutions through elaborately decomposing their
 privacy-preserving functionalities. The PGU model is designed as the triad of
 Phase, Guarantee, and technical Utility. Furthermore, we also discuss the unique
 characteristics and challenges of PPML and outline possible directions of future
 work that benefit a wide range of research communities among ML, distributed
 systems, security, and privacy areas.

 Key Phrases: Survey, Machine Learning, Privacy-Preserving Machine Learning

 ∗ Part of this work was done while Runhua Xu was at the School of Computing and Information, University
 of Pittsburgh.
 † This work was performed while James Joshi was serving as a program director at NSF; and the work
 represents the views of the authors and not that of NSF.

 Preprint.
Contents
1 Introduction 3

2 Machine Learning Pipeline in a Nutshell 5
 2.1 Differentiate Computation Tasks in Model Training and Serving . . . . . . . . . . 5
 2.2 An Illustration of Third-party Facility-related ML Pipeline . . . . . . . . . . . . . 6

3 Privacy-Preserving Phases in PPML 7
 3.1 Privacy-Preserving Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
 3.1.1 Privacy-Preserving Data Preparation . . . . . . . . . . . . . . . . . . . . . 7
 3.1.2 Privacy-Preserving Model Training . . . . . . . . . . . . . . . . . . . . . 8
 3.2 Privacy-Preserving Model Serving . . . . . . . . . . . . . . . . . . . . . . . . . . 9
 3.3 Full Privacy-Preserving Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Privacy Guarantee in PPML 10
 4.1 Object-Oriented Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 10
 4.2 Pipeline-Oriented Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Technical Utility in PPML 12
 5.1 Type I: Data Publishing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 12
 5.1.1 Elimination-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 13
 5.1.2 Perturbation-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 13
 5.1.3 Confusion-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 14
 5.2 Type II: Data Processing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 14
 5.2.1 Additive Mask Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 15
 5.2.2 Garbled Circuits Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 16
 5.2.3 Modern Cryptographic Approaches . . . . . . . . . . . . . . . . . . . . . 17
 5.2.4 Mixed-Protocol Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 19
 5.2.5 Trusted Execution Environment Approach . . . . . . . . . . . . . . . . . . 20
 5.3 Type III: Architecture Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
 5.3.1 Delegation-based ML Architecture . . . . . . . . . . . . . . . . . . . . . . . 21
 5.3.2 Distributed Selective SGD Architecture . . . . . . . . . . . . . . . . . . . 22
 5.3.3 Federated Learning (FL) Architecture . . . . . . . . . . . . . . . . . . . . 22
 5.3.4 Knowledge Transfer Architecture . . . . . . . . . . . . . . . . . . . . . . 22
 5.4 Type IV: Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
 5.5 Technical Path and Utility Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Challenges and Potential Directions 24
 6.1 Open Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
 6.2 Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
 6.2.1 Systematic Definition, Measurement and Evaluation of Privacy . . . . . . . 26
 6.2.2 Strategies of Attack and Defense . . . . . . . . . . . . . . . . . . . . . . . 26
 6.2.3 Communication Efficiency Improvement . . . . . . . . . . . . . . . . . . 27
 6.2.4 Computation Efficiency Improvement . . . . . . . . . . . . . . . . . . . . 27
 6.2.5 Privacy Perturbation Budget and Model Utility . . . . . . . . . . . . . . . 27
 6.2.6 New Deployment Approaches of Differential Privacy in PPML . . . . . . . 28
 6.2.7 Compatibility of Privacy, Fairness, and Robustness . . . . . . . . . . . . . 28
 6.2.8 Novel Architecture of PPML . . . . . . . . . . . . . . . . . . . . . . . . . 28
 6.2.9 New Model Publishing Method for PPML . . . . . . . . . . . . . . . . . . 28
 6.2.10 Interpretability in PPML . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
 6.2.11 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Conclusion 29

 2
1 Introduction

Machine learning (ML) is increasingly being applied in a wide variety of application domains. For
instance, emerging deep neural networks, also known as deep learning (DL), have shown significant
improvements in model accuracy and performance, especially in application areas such as computer
vision, natural language processing, speech or audio recognition [1, 2, 3]. Emerging federated
learning (FL) is another collaborative ML technique that enables training a high-quality model while
training data remains distributed over multiple decentralized devices [4, 5]. FL has shown its promise
in various application domains, including healthcare, vehicular networks, intelligent manufacturing,
among others [6, 7, 8]. Although these models have shown considerable success in AI-powered or
ML-driven applications, they still face several challenges, such as (i) lack of powerful computational
resources and (ii) availability of vast volumes of data for model training. In general, the performance
of an ML system relies on a large volume of training data and high-powered computational resources
to support both the training and inference phases.
To address the need for computing resources with high-performance CPUs and GPUs, large memory
storage, etc., existing commercial ML-related infrastructure service providers, such as Amazon,
Microsoft, Google, and IBM, have devoted significant amounts of their efforts toward building
infrastructure as a service (IaaS) or machine learning as a service (MLaaS) platforms with appropriate
rental expense. The resource-limited clients can employ ML-related IaaS to manage and train their
models first and then provide data analytics and prediction services through their applications directly.
Availability of a massive volume of training data is another challenge for ML systems. Intuitively,
more training data indicates better performance of an ML model; thus, there is a need for collecting
large volumes of data that are potentially from multiple sources. However, the collection and use
of the data, and even the creation and use of ML models, raise serious privacy concerns because of
the risk of leakage of highly secure or private/confidential information. For instance, recent data
breaches have increased the privacy concerns of large-scale collection and use of personal data
[9, 10]. An adversary can also infer private information by exploiting an ML model via various
inference attacks such as membership inference attacks [11, 12, 13, 14, 15], model inversion attacks
[16, 17, 18], property inference attacks [19, 20], and privacy leakage from the exchanged gradients
in distributed ML scenes [21, 22]. In an example of membership inference attacks, an attacker
can infer whether or not data related to a particular patient has been included in the training of an
HIV-related ML model. In addition, existing regulations such as the Health Insurance Portability and
Accountability Act (HIPPA) and more recent ones such as the European General Data Protection
Regulation (GDPR), Cybersecurity Law of China, California Consumer Privacy Act (CCPA), etc.,
are increasingly restricting the availability and use of privacy-sensitive data. Such privacy concerns
of users and requirements imposed by regulations pose a significant challenge for the training of a
well-performing ML model; hence they are consequently hindering the adoption of ML models for
real-world applications.
To tackle the increasing privacy concerns related to using ML in applications, in which users’ privacy-
sensitive data such as electronic health/medical records, location information, etc., are stored and
processed, it is crucial to devise innovative privacy-preserving ML solutions. More recently, there have
been increasing efforts focused on PPML research that integrate existing traditional anonymization
mechanisms into ML pipelines or design privacy-preserving methods and architectures for ML
systems. Recent ML-related or FL-oriented surveys such as in [23, 24, 25, 26, 27, 28] have partially
illustrated or discussed the specific privacy and security issues in ML or FL systems. Each existing
PPML approach addresses part of privacy concerns or is only applicable to limited scenarios. There
is no unified or perfect PPML solution without any sacrifice. For instance, the adoption of differential
privacy in ML systems can lead to model utility loss, e.g., reduced model accuracy. Similarly, the
use of secure multi-party computation approaches incurs high communication overhead because of a
large volume of intermediate data such as garbled tables of circuit-gates that need to be transmitted
during the execution of the protocols or high computation overhead due to the adopted ciphertext-
computational cryptosystems.
ML security, such as issues of stealing the ML models, injecting Trojans and availability of ML
services and corresponding countermeasures, have been thoroughly discussed in the systematization
of knowledge [29], surveys [30, 31] and comprehensive analysis [32]. However, there is still a lack
of systematization of knowledge discussion and evaluation from the privacy aspect. Inspired from
the CIA model (a.k.a, confidentiality, integrity and availability triad) designed to guide policies for

 3
Figure 1: An overview of PGU model to evaluate the privacy-preserving machine learning systems
and illustration of selected PPML examples in the PGU model. The demonstrated PPML examples in
the figure are HybridAlpha [33], DP-SGD [34], NN-EMD[35], SA-FL[36].

information security within an organization, this paper proposes a PGU model to guide evaluation
for various privacy-preserving machine learning through elaborately decomposing their privacy-
preserving functionalities, as illustrated in Figure 1. The PGU model is designed as the triad of Phase,
Guarantee, and technical Utility: the phase represents the phase of privacy-preserving functionalities
that occurs in ML pipeline; guarantee denotes the intensity and scope of the privacy protection under
the setting of threat model and trust assumption; technical utility means the utility impact of adopted
techniques or design achieving privacy-preserving goals in the machine learning systems. Based
on the PGU analysis framework, this paper also discusses possible challenges and potential future
research directions of PPML.
More specifically, this paper first introduces the general pipeline in an ML application scenario.
Then we discuss the PPML pipeline from various phases of privacy-preserving functionalities that
occurred in the process-chain in the systems, usually including privacy-preserving data preparation,
privacy-preserving model training and evaluation, privacy-preserving model deployment, and privacy-
preserving model inference.
Next, based on common threat model settings and trust assumptions, we discuss the privacy guarantee
of existing PPML solutions by analyzing the intensity and scope of the privacy protection from two
aspects: object-oriented privacy protection and pipeline-oriented privacy protection. From the object
perspective, PPML solutions either target the data privacy to prevent private information leakage
from the training or inference data samples or the model privacy to mitigate the privacy disclosure in
the trained model. From the pipeline perspective, PPML solutions focus on the privacy-preserving
functionality at one entity or a set of entities in the pipeline of an ML solution, e.g., the distributed
ML system and IaaS/MLaaS-related ML system. Usually, it includes local privacy, global privacy,
and full-chain privacy.
In addition, the paper evaluates PPML systems by investigating their technical utility. In particular,
we first elaborate underlying privacy-preserving techniques by decomposing and classifying existing
PPML solutions into four categories: data publishing approaches, data processing approaches,
architecture based approaches and hybrid based approaches. Then, we analyze their impact on an
ML system from a set of aspects: computation utility, communication utility, model utility, scalability
utility, scenario utility and privacy utility.
Finally, we present our viewpoints on the challenges of designing PPML solutions and then the future
research directions in the PPML area. Our discussion and analysis broadly contribute to machine
learning, distributed system, security, and privacy areas.
Organization. The remainder of this paper is organized as follows. We briefly present the ML
pipeline in Section 2 by reviewing the critical task in ML-related systems, as well as illustrating
third-party facility-related ML solutions. In Section 3, we present general discussion of existing
privacy-preserving methods with consideration of the phase where they are applied, and discuss the

 4
privacy guarantee in Section 4. We investigate the technical utility of PPML solutions in Section 5
by summarizing and classifying privacy-preserving techniques and their impact on ML systems.
Furthermore, we also discuss the challenges and open problems in PPML solutions and outline the
promising directions of future research in Section 6. Finally, we conclude the paper in Section 7.

2 Machine Learning Pipeline in a Nutshell
An ML system typically includes four phases: data preparation or preprocess, model training and
evaluation, model deployment, and model serving. More roughly, it can be concluded as training and
serving. The former includes operations such as collecting and preprocessing data, training a model,
and evaluating the model performance. The latter mainly focuses on using a trained model, such as
how to deploy the model and provide inference service.
In this section, we first formally differentiate computation tasks between the model training and model
serving, which will enhance the following up discussion on privacy-preserving training and privacy-
preserving serving approaches from the perspective of the computation task. Next, we illustrate
generic process paths in the scenario of ML pipeline, including the adoption of both self-owned and
third-party facilities, to cover most ML applications. Specifically, we divide the processing pipeline
into three trust domains: data owner trust domain, third-party trust domain, and model user trust
domain. Based on the divided trust domain, we can set up various trust assumptions and potential
adversary’s behavior to analyze the privacy guarantee of existing PPML solutions.

2.1 Differentiate Computation Tasks in Model Training and Serving

From the underlying computation task perspective, there is no strict boundary between the model
training and model serving (i.e., inference). The computed function in the serving procedure could be
viewed as a simplified version of the training procedure without loss computation, regularization,
(partial) derivatives, and model weights update. For instance, in a stochastic gradient descent (SGD)
based training approach, the computation that occurs at the inference phase could be viewed as one
round of computation in the training phase without operations related to model gradients update. In a
more complex neural networks example, the training computation is the process where a set of data
is continuously fed to the designed network for multiple training epochs. In contrast, the inference
computation could be treated as only one epoch of computation for one data sample to generate a
predicting label without propagation procedure and related regularization or normalization.
Formally, given a set of training samples denoted as ( 1 , 1 ), ..., ( , ), where ∈ R , ∈ R, the
goal of a ML model training (for simplicity, assume a linear model) is to learn a fit function denoted
as
 , ( ) = | + , (1)
where ∈ R is the model parameters, and is the intercept. To find proper model parameters,
usually, we need to minimize the regularized training error given by
 
 1 ∑︁
 ( 
 , ) = L ( , ( )) + ( 
 ), (2)
 =1

where L (·) is a loss function that measures model fit and (·) is a regularization term (a.k.a., penalty)
that penalizes model complexity; is a non-negative hyperparameter that controls the regularization
strength. Regardless of various choices of L (·) and (·), stochastic gradient descent (SGD) is a
common optimization method for unconstrained optimization problems as discussed above. A simple
SGD method iterates over the training samples and for each sample updates the model parameters
according to the update rule given by
 ← − ∇ = − [ ∇ + ∇ L] (3)
 ← − ∇ = − [ ∇ + ∇ L] (4)
where is the learning rate which controls the step-size in the parameter space.
Given the trained model ( trained , trained ), the goal of the model serving is to predict a value ˆ for
target sample as follows:
 ˆ = trained , trained ( ). (5)

 5
Figure 2: An illustration of machine learning pipeline (above part) and demonstration of correspond-
ing processes with trust domain partition (bottom part) in various scenarios of ML applications.

As demonstrated above, the computed functions in the inference phase (i.e., Equation (5)), is part of
computed procedures in the SGD training (i.e., Equation (2)). Taking emerging deep neural network
as an example, the model training process is a model inference process with extra back-propagation to
compute the partial derivative of weights. In particular, the task of privacy-preserving training is more
challenging than the task of privacy-preserving serving. Most existing computation-oriented privacy-
preserving training solutions indicate the availability to achieve privacy-preserving serving even
though the proposal does not evidently state that, but not vice versa. A more specific demonstration
is presented in the rest of the paper.

2.2 An Illustration of Third-party Facility-related ML Pipeline

As depicted in Figure 2, we briefly overview the ML pipeline and the corresponding process chains
and facilities, including self-owned resources and third-party computational resources (e.g., IaaS,
PaaS, and MLaaS). Note that third-party facility-related ML pipeline is also a widespread adoption
in recent ML-related applications. The pipeline is divided into four stages: data preparation that
collects and preprocess the dataset; model training and evaluation that applies or designs proper ML
algorithm to train and evaluate an ML model; textitmodel deployment that delivers the model to the
target user or deploys the model to a third-party agency; and model serving that generates prediction
results or provides inference service.
From the perspective of the types of participants, the ML pipeline includes three entities/roles: the
data producer (DP), the model consumer (MC) , and the computational facility (CF) that may be
owned by the data producer and model consumer themselves or employed from a third-party entity.
The data producer owns and provides a vast volume of training data to learn various ML models.
At the same time, the model consumer also owns vast amounts of target data and expects to acquire
various ML inference services such as classification of the target data with labels, prediction values
based on the target data, and aggregation to several groups.
For the stage of model training/evaluation or model deployment/serving, there exist two possible
cases: the data producer (or the model consumer) (i) intends to use locally owned CFs, or (ii) prefers
to employ third-party CFs instead of local CFs. As a result, different computation options exist for
model training/evaluation and model deployment/serving. As illustrated in Figure 2, in case (i), the
data producer can train a complete model locally (T1) or locally train a partial model that can be
used to aggregate a global ML model in a collaborative or distributed manner (T2). In case (ii), the
data producer directly sends out its data to third-party entities that have computational facilities to
employ their computational resources to train an ML model (T3). Such third-party facilities may
include the edge nodes in the edge computing environment and/or IaaS servers in the cloud computing
environment. Accordingly, the model consumer may also acquire the trained model directly (D1) for
model inference service with unlimited access (S1) if it has local computation facilities; otherwise,
the model consumer can also utilize a third-party facility where the trained model is deployed (D2) to
acquire the prediction service (S2).

 6
From the perspective of privacy-preserving phases in the ML pipeline, we classify the research of
PPML into two directions: privacy-preserving training phase including private data preparation and
model creation and the privacy-preserving serving phase involving private model deployment and
serving. We discuss the affected phases of privacy-preserving approaches in Section 3 in detail.
It is also important to consider the trust domains to characterize the trust assumptions and related
threat models. We divide the ML system into three domains: the data owner’s local trust domain, the
third-party CF trust domain, and the model user’s trust domain. Based on those trust domains, we
analyze various types of privacy guarantees provided by a PPML system. We present more details in
Section 4.
From the perspective of underlying adopted techniques, we decompose recently proposed PPML
solutions to summarize and classify the critical privacy-preserving components to evaluate their
potential utility impact. Intuitively, the underlying privacy-preserving techniques such as differential
privacy, conventional multi-party computation building on the garbled circuits and oblivious transfer,
or customized secure protocols are widely employed in the PPML solutions. Besides, the well-
designed learning architectures have been broadly studied under specific trust domain and threat
model settings. Furthermore, emerging advanced cryptosystems such as homomorphic encryption
and functional encryption also show their promise for PPML with strong privacy guarantees. More
detailed taxonomy and analysis will be introduced in Section 5.

3 Privacy-Preserving Phases in PPML
This section evaluates existing PPML solutions from the perspective of privacy-preserving phases in
an ML pipeline. Figure 2 illustrates four phases in a typical ML pipeline: data preparation, model
training and evaluation, model deployment and model serving. Correspondingly, the existing PPML
pipeline involves (i) privacy-preserving data preparation, (ii) privacy-preserving model training
and evaluation, (iii) privacy-preserving model deployment, (iv) privacy-preserving inference. For
simplicity, this section analyzes the PPML mainly from privacy-preserving model creation covering
phases (i-ii) and privacy-preserving model serving including phases (iii-iv) to avoid overlapping
discussion among those phases.

3.1 Privacy-Preserving Model Creation

Most privacy-preserving model creation solutions emphasize that the adopted privacy-preserving
approaches should prevent the private information in the training data from leaking the trusted scope
of the data sources. In particular, the key factors of the model creation are data and computation, and
consequently, research community tries to tackle the challenge of privacy-preserving model creation
from the following two aspects:

 (i) how to distill/filter the training data such that the data in the subsequent processing include
 less or even no private information;
 (ii) how to process the training data in a private manner.

3.1.1 Privacy-Preserving Data Preparation
From the perspective of data, existing privacy-preserving approaches work on the following directions:
(i) adopting the traditional anonymization mechanisms such as -anonymity[37], -diversity[38], and
 -closeness [39] to remove the identifier information in the training data before sending out the data
for training; (ii) representing the raw dataset to surrogate dataset by grouping the anonymized data
[40] or abstracting the dataset by sketch techniques [41, 42]; (iii) employing the -differential privacy
mechanisms [43, 44, 45] to add privacy budget (noise) into the dataset to avoid private information
leakage in the statistical queries.
Specifically, the proposal in [46] tries to providing -anonymity in the data mining algorithm, while
the works in [47, 48] focus on the utility metric and provide a suite of anonymization algorithms to
produce an anonymous view based on ML workloads. On the other hand, the recent differential privacy
mechanism has shown its promise in emerging DL models that rely on training on a large-scale dataset.
For example, an initial work in [34] proposes a differentially private stochastic gradient descent
approach to train the privacy-preserving DL model. Regarding most recently proposed representative

 7
works, the proposal [49] demonstrates that it is possible to train large recurrent language models with
user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Recent
parameter-transfer meta-learning (i.e., the applications including few-shot learning, federated learning,
and reinforcement learning) often requires the task-owners to share model parameters that may result
in privacy disclosure. Proposals of privacy-preserving meta-learning such as presented in [50, 33]
try to tackle the problem of private information leakage in federated learning (FL) by proposing an
algorithm to achieve client-sided (local) differential privacy. The literature [51] formalizes the notion
of task-global differential privacy and proposes a differentially private algorithm for gradient-based
parameter transfer that satisfies the privacy requirement as well as retaining provable transfer learning
guarantees in convex settings.
Thanks to recent successful studies on computation over ciphertext (i.e., practical computation
over the encrypted data) in the cryptography community, the data encryption approach is going to
be another promising approach to protect the privacy of the training data. Unlike the traditional
anonymization mechanisms or differential privacy mechanisms that still resist on the inference or
de-anonymization attacks such as demonstrated in [52, 53, 11, 54], wherein the adversary may
have additional background knowledge, the encryption approaches can provide a strong privacy
guarantee - called confidential-level privacy in the rest of the paper - and hence are receiving more
and more attention in recent studies [55, 56, 33, 57, 58, 59, 60, 61, 62], wherein the training data or
the transferred model parameter is protected by cryptosystems while still allowing the subsequent
computation outside of the trusted scope of the data sources.

3.1.2 Privacy-Preserving Model Training

From the perspective of computation, existing privacy-preserving approaches are also correspond-
ingly divided into two directions: (i) for the case that the training data is processed by conventional
anonymization or differential privacy mechanisms, the subsequent training computation is as normal
as vanilla model training; (ii) for the case that the training data is protected via cryptosystems, due to
the confidential-level privacy, the privacy-preserving (i.e., crypto-based) training computation is a little
bit more complex than normal model training. The demand of training computation over the ciphertext
indicates that the direct use of traditional cryptosystems such as AES and DES is not applicable here.
That crypto-based training relies on recently proposed modern cryptography schemes, mainly refers
to homomorphic encryption [63, 64, 65, 66, 67] and functional encryption[68, 69, 70, 71, 72, 73, 74],
which allows computation over the encrypted data. In general, homomorphic encryption (HE) is a
form of public-key encryption, allowing one to perform computation over encrypted data without
decrypting it. The result of the computation is still in an encrypted form and is as same as if the
computation had been performed on the original data. Similarly, functional encryption (FE) is also
a generalization of public-key encryption in which the one with an issued functional decryption
key allows to learn a function of what the ciphertext is encrypting without learning the underlying
protected inputs.
Unlike normal training process, it is worth noting that there may be an extra step - data conversion
or data encoding - in the crypto-based training, since most of those cryptosystems, such as multi-
party functional encryption [70, 75] and BGV schemes [76] (i.e., an implemented homomorphic
encryption), are built on the integer group while the training data or exchanged model parameter
is always in floating-point numbers format, especially, those training samples are preprocessed via
commonly adopted methods, such as feature encoding, discretization, normalization, etc. However, it
is not a requisite restriction among all crypto-based training approaches. For instance, an emerging
implemented CKKS scheme [77] - an instance of homomorphic encryption - can support approximate
arithmetic computation. The data conversion usually includes a pair of encoding and decoding
operations. The encoding step is commonly adopted to convert the floating-point numbers into
integers so that the data can be encrypted and then applied in crypto-based training. On the contrary,
the decode step is applied to the trained model or crypto-based training result to recover those
floating-point numbers. There is no doubt that the effectiveness and efficiency of those rescaling
procedures rely on the conversion precision setting. We will discuss the detail of the potential impact
of the data conversion in Section 5.

 8
3.2 Privacy-Preserving Model Serving

There is no clear boundary between the privacy-preserving model deployment and model inference
among most PPML solutions; hence we combine the discussion as the privacy-preserving model
serving here.
From the perspective of computation, unlike the task of privacy-preserving training, tackling privacy-
preserving model serving problems is relatively simpler, as demonstrated in Section 2.1. Except
for the emerging complex deep neural networks model, there is no much separate study focused
on privacy-preserving inference. According to our exploration, most PPML solutions such as in
[60, 78, 79, 80, 61, 81, 82, 83] that apply the modern cryptosystems (mainly refer to the homomorphic
encryption and its related schemes) only target the privacy-preserving inference, as those crypto-based
solutions are not efficient enough to be applied for the complex and massive training computation of
neural networks. Note that those privacy-preserving serving solutions mainly focus on protecting the
data samples needed to be inferred, or the trained model, or both.
Besides, another branch of privacy-preserving model serving research focuses on the privacy-
preserving model query or publication in the case that the trained model is deployed in a privacy-
preserving manner, where the model consumer and mode owner are usually separated. The core
problem in this study branch focuses on how to prevent the adversary model user from inferring
the private information of the training data. According to various model inference attack settings,
the adversary has limited (or unlimited) access times to query the trained model. In addition, the
adversary may (or may not) have extra background knowledge of the trained model specification,
usually called white-box (or black-box) attacks. To address those inference attack issues, a naive
privacy-preserving solution is to limit the query times or reduce the background information of the
released model for a specific model user. Beyond that, potential prevention methods include the
following directions:

 (i) private aggregation of teacher ensembles (PATE) approaches [84, 85, 86], wherein the
 knowledge of an ensemble of “teacher” models is transferred to a “student” model, with
 intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed
 by noisy aggregation of teachers’ answers;
 (ii) model transformation approach such as MiniONN [87] and variant solution [88], where an
 existing model is transformed into an oblivious neural network supporting privacy-preserving
 predictions with reasonable efficiency;
 (iii) model compression approach, especially, applied in the emerging deep learning model
 with a large scale of model parameters, where knowledge distillation methods [89, 90] are
 adopted to compress the trained deep neural networks model. Even though the main goal
 of knowledge distillation is to reduce the size of the deep neural networks model, such
 a method also brings additional privacy-preserving functionality [91, 92]. Intuitively, the
 distillation procedure not only removes the redundant information in the model, but also
 reduces the probability that the adversary can infer potential private information in the model
 by iterative queries.

3.3 Full Privacy-Preserving Pipeline

The notion of a full privacy-preserving pipeline is rarely mentioned in PPML proposals. Existing
PPML solutions either claim the support of privacy-preserving model creation or focus on privacy-
preserving model serving. As demonstrated in Section 2.1, the computation tasks at the model
inference phase can be viewed as a non-iterative and simplified version of operations of the model
training. Thus, from the perspective of the computation task, the problems of the privacy-preserving
inference could be a subset problem of the privacy-preserving training. Examples of existing
computation-oriented PPML proposals focusing on the privacy-preserving training indicate the
support of the privacy-preserving inference, as illustrated in [93, 94, 95, 56, 62, 35], and hence they
can be evaluated as full privacy-preserving pipeline approaches.
We note that the PPML solutions relying on privacy-preserving data preparation approaches such as
anonymization, sketching, and differential privacy are usually not applicable with privacy-preserving
inference. The goal of inference is to acquire an accurate prediction for a single data point; however,
those approximation or perturbation techniques are either not applicable to data needed to be predicted

 9
or reduce the data utility for inference. Thus, the data-oriented PPML proposals as introduced in
Section 3.1 is incompatible with the privacy-preserving inference goal.
Besides, another direction of a complete privacy-preserving pipeline could be trivially integrating
privacy-preserving model creation approaches and those privacy-preserving model serving approaches
(i.e., privacy-preserving model query or publication-based methods for model deployment accompa-
nying with computation-oriented privacy-preserving inference). For instance, it is possible to produce
a deep neural networks model with the most privacy-preserving training approaches. Then the trained
model can be transformed into an oblivious neural network to support privacy-preserving predictions.

4 Privacy Guarantee in PPML
Privacy is a concept in disarray. It is hard to accurately articulate what privacy means. In general,
privacy is a sweeping concept, encompassing freedom of thought, control over one’s body, solitude
in one’s home, control over personal information, freedom from surveillance, protection of one’s
reputation, and protection from searches and interrogations [96]. In our viewpoint, privacy is a
subjective assessment of the acceptability degree of what personal information can be disclosed to
the trustless domains and how much personal information is publicly available. It is a challenge to
define what private information is and how to measure privacy since privacy is a subjective notion
where different people may have different perceptions or viewpoints.
In the digital environment, usually, a widely accepted notion of minimum privacy coverage is the
personal identity [37, 43, 97]. Some common approaches include differential privacy mechanisms
[43, 97] and k-anonymity mechanism [37] and its follow-up work such as l-diversity [38], t-disclosure
[39]. Differential privacy aims to hide individuals from the patterns of a dataset, e.g., the results of
predefined functions queried from the dataset. The general idea behind differential privacy is that
if the effect of making an arbitrary single substitution in the database is small enough, the query
result cannot be used to infer much about any single individual, and therefore, such as query result
provides privacy protection [43, 97]. The goal of the anonymization mechanisms is to remove the
identity-related information from the dataset directly. Given a person-specific field-structured dataset,
k-anonymity and its variants focus on defining and hiding the identifiers and quasi-identifiers with
guarantees that the individuals who are the subjects of the data cannot be re-identified while the data
remain useful [37].
Various privacy-related terms and notions exist in PPML proposals to claim their privacy-preserving
functionality, resulting in no universal definition for privacy guarantee in PPML. This paper explores
those widely discussed privacy-related notions to analyze the privacy guarantee of PPML from two
aspects: object-oriented privacy guarantee and pipeline-oriented privacy guarantee. The former
focuses on measuring the privacy protection on specific objects in PPML, namely, the critical
objects such as the trained model weights, the exchanged gradients, the training or inference data
samples, among others. The latter evaluates the privacy guarantee by inspecting the entire pipeline,
as illustrated in Figure 2. We elaborate on each perspective in the following sections.

4.1 Object-Oriented Privacy Guarantee

The privacy claim of most early PPML solutions is object-oriented, where the privacy protection
objectives are on a specific object in the PPML system. To achieve the privacy goal, a set of PPML
solutions protect the dataset directly, such as removing identifiers and quasi-identifiers from the
dataset using empirical anonymization mechanisms [37, 38, 39]. To overcome the concerns of
empirical privacy guarantee, differential privacy (DP) mechanism [43, 97] is proposed and widely
adopted among various domains since the privacy guarantee of DP is mathematically proved. In
addition, encryption is a more strict approach to protect the data, where it requires learning from
the dark since the data is confused and diffused. We summarize such kind of privacy guarantee as
data-oriented privacy guarantee, or data privacy in short in the rest of the paper, as described in
Definition 1.

Definition 1 (Data Oriented Privacy Guarantee) A PPML solution claims the data-oriented pri-
vacy guarantee indicating that any PPML adversary cannot learn private information directly
from the training/inferences data samples directly or link private information to a specific person’s
identifier.

 10
In short, data-oriented privacy-preserving approaches aim to prevent privacy leakage from the dataset
directly. However, privacy is not free; an oblivious limitation of data privacy is the sacrifice of
data utility. For instance, the empirical anonymization mechanism needs to aggregate and remove
proper feature values. At the same time,some values of quasi-identifiers features are entirely removed
or partially deleted to satisfy the -diversity and -disclosure definition. Besides, a noise budget is
required to add into the data sample by the differential privacy mechanism. Both approaches impact
the accuracy of the trained model. The encrypted data could provide confidentiality on the dataset but
brings extra computation overhead in the consequent machine learning training.
Another set of PPML solutions focus on providing privacy-preserving models, which indicates the
privacy protection objective is the trained model in the PPML system. The goal of the privacy-
preserving model is to prevent privacy leakage in the trained model and the use of the model.
Examples of that privacy information include membership, property, etc. As described in Definition 2,
we summarize such a privacy guarantee as a model-oriented privacy guarantee or model privacy in
shot in the rest of the paper.

Definition 2 (Model Oriented Privacy Guarantee) A PPML solution claims the model-oriented
privacy guarantee denoting that any PPML adversary cannot infer private information from the
model via limited or unlimited periodically querying a specific model.

To achieve the model privacy guarantee, existing PPML solutions are located in two directions: (i)
introducing differentially private training algorithms to perturb the trained model parameters; (ii)
controlling the model access times and model access patterns to limit the adversary’s ability. For
instance, Adadi et al. [34] propose a differentially private stochastic gradient descent (DP-SGD)
algorithm by adding a differential privacy budget (noise) into the clipped gradients to achieve a
differential private model. The private aggregation of teacher ensembles (PATE) framework [84]
designs a remarkable model deployment architecture, where a set of ensemble models are trained as
the teacher models to provide model inference service for a student model.

4.2 Pipeline-Oriented Privacy Guarantee

Existing privacy measurement approaches such as differential privacy and -anonymity are only
applicable for specific objects such as datasets and models but cannot be directly adopted to assess
the privacy guarantee of PPML pipeline. There is a lack of formal or informal approaches for
assessing the intensity and scope of privacy protection provided by an ML pipeline. We believe that
assessing the privacy guarantee relies on defining (i) the boundary of data processing and (ii) the
trust assumption on each processing domain in the pipeline. For instance, suppose that a data owner
employs a third-party computational facility (CF) to process its privacy-sensitive data; this creates a
boundary in data processing workflow into two parts: data owner’s local domain and CF’s domain. If
the data owner fully trusts CF, there may not exist privacy concerns; otherwise, there is a demand for
a privacy guarantee on the data processing.
As depicted in Figure 2, we establish the trust boundaries of the processing pipeline in PPML as data
producers, local CF, third-party CF, and model consumers. From the perspective of a data owner
(i.e., data producer), it may have different trusts on other domains. For instance, the data producer
may fully trust its local CF, semi-trust the third-party CF, and may have no trust at all in the model
consumer. Based on such trust assumptions for each boundary, we present the taxonomy of privacy
guarantee notions from the data owner’s perspective as below:

 • No Privacy Guarantee: Here, the raw training data is shared with third-party CFs, regardless
 of whether these CFs are trusted or not, without applying any privacy-preserving approaches.
 Each entity is able to acquire the original raw data to process or the ML model to consume
 according to its role in the ML pipeline.
 • Global Privacy Guarantee: Global privacy guarantee focuses on the ML model serving
 phase. The data producer has generated the trained ML model with its own CFs, or a trusted
 third-party entities with powerful CFs to help train the ML model based on the raw data
 shared by the data producer. The global privacy guarantee aims to ensure that there is no
 privacy-sensitive information leakage during the model deployment and model inference
 phases. Essentially, the ML model is a statistical abstraction or pattern computed from the
 raw data, and hence the privacy disclosure that may occur here is viewed as a statistical

 11
leakage. Several typical privacy leakage examples include the leakage of membership, class
 representatives, properties, etc. For instance, an ML model for helping diagnose HIV can
 be trained using existing healthcare records of HIV patients. A successful membership
 attack on the model will allow an adversary to check whether a selected target sample (i.e.,
 patient’s record) is used in the training or not, resulting in revealing whether a target person
 is an HIV patient or not.
 • Vanilla Local Privacy Guarantee: The basic local privacy guarantee ensures that the privacy-
 sensitive raw data is not directly shared with other honest entities in the ML pipeline. The
 indirect-sharing approaches include: (i) the raw data is pre-processed to remove privacy-
 sensitive information, or obfuscate private information with noise before sending out for
 model training; (ii) the raw data is pre-trained in a local model in a collaborative manner,
 while the generated model update is revealed to other entities.
 • Primary Local Privacy Guarantee: The primary local privacy is built upon the vanilla local
 privacy guarantee. The main difference is that the trust assumption of the rest of the pipeline.
 Beyond the basic requirement of vanilla privacy guarantee, it also requires that the shared
 local model update should be private to curious third-party CF entities other than honest CF
 entities such as the training server in the IaaS platform and coordinator server in a distributed
 collaborative ML system.
 • Enhanced Local Privacy Guarantee: The enhanced local privacy is built upon the primary
 local privacy guarantee with additional assumption requirement that the third-party CF
 entities would be totally untrusted.
 • Full Privacy Guarantee: The requirement of a full privacy guarantee includes both local
 privacy and global privacy. As the definitions as mentioned above, the global privacy
 guarantee focuses on the ML model serving phase, while the local privacy ensures the
 privacy guarantee in the model creation phase, the full privacy ensures privacy protection at
 each step in the ML pipeline, as illustrated in Figure 2.

In particular, the privacy leakage is specific to the threat model setting that considers the adversary
behaviors and abilities and assumes the worst situation that an ML system can deal with. Simulta-
neously, the threat model also reflects users’ confidence in the entire data process pipeline and the
trustworthiness of each entity. Consequently, as mentioned above, the privacy guarantees strongly
correlate to the specific threat model defined in the PPML system.

5 Technical Utility in PPML
In this section, we first classify and discuss the PPML solutions in a more specific manner, namely,
the aspect of privacy-preserving techniques by decomposing those solutions and tracking how those
approaches tackle the following questions:

 • How the privacy-sensitive data is released/published?
 • How the privacy-sensitive data is processed/trained?
 • Does the architecture of the ML system prevent the disclosure of private-sensitive informa-
 tion?

Correspondingly, we summarize the privacy-preserving approaches into four categories: data
publishing-based approaches, data processing-based approaches, architecture-based approaches,
hybrid approaches that may combine or fuse two or three approaches as mentioned earlier. Next, we
analyze the potential impact (i.e., utility cost) of those privacy-preserving techniques to normal ML
solutions from computation utility, communication utility, model utility, scalability utility, etc.

5.1 Type I: Data Publishing Approaches

In general, the data publishing based privacy-preserving approaches in the PPML system include the
following types: approaches that partially eliminate the identifiers and/or quasi-identifiers in the raw
data; approaches that partially perturb the statistical result of the raw data; approaches that totally
confuse the raw data.

 12
5.1.1 Elimination-based Approaches
The traditional anonymization mechanisms are classified to the private information elimination
approach, where the techniques such as -anonymity[37], -diversity[38] and -closeness [39] are
applied to the raw privacy-sensitive data to remove private information and resist the potential
inference attack. Specifically, the -anonymity mechanism can ensure that the information for each
person contained in the released dataset cannot be distinguished from at least -1 individuals whose
information is also released in the dataset. To achieve that the -anonymity defines the identifiers
and quasi-identifiers for each data attribute, and then remove the identifiers and partially hide the
quasi-identifiers information. The -diversity mechanism introduces the concept of equivalence
class, where an equivalence class is -diversity if there are at least “well-represented” values for
the sensitive attribute. Essentially, as an extension of the -anonymity mechanism, the -diversity
mechanism reduces the granularity of the data representation and additionally maintain the diversity
of sensitive fields by adopting techniques like generalization and suppression such that given any
records it can map to at least − 1 other records in the dataset. The -closeness further refinement
of -diversity by introducing additional restriction about the value distribution on the equivalence
class, where an equivalence class is -closeness if the distance between the distribution of a sensitive
attribute in this class and the distribution of the attribute in the whole dataset is no more than a
threshold .
Examples of emerging anonymization-based PPML solutions include [40, 98] that focus on secure
or privacy-preserving federated gradient boosted trees model. Yang et al. [40] employ a modified
 -anonymity based data aggregation method to compute the gradient and hessian by projecting
original data in each feature to avoid privacy leakage, instead of directly transmitting all exact data for
each feature. Furthermore, Ong et al. [98] propose an adaptive histogram-based federated gradient
boosted trees by a data surrogate representation approach that is compatible with -anonymity method
or differential privacy mechanism.

5.1.2 Perturbation-based Approaches
We present two typical perturbation-based approaches: differential privacy mechanisms and sketching
methods.
Differential privacy based perturbation: The conventional perturbation-based privacy-preserving
data publishing approaches mainly rely on the -differential privacy technique[43, 44, 45]. Formally,
according to [97, 44], the definition of the -differential privacy is described as follows: a randomized
mechanism M : D → 0R with domain D and range R satisfies ( , )-differential privacy if for any
two adjacent input , ∈ D and for any subset of outputs ⊆ R, it holds that
 0
 Pr[M ( ) ∈ ] ≤ · Pr[M ( ) ∈ ] + . (6)

The additive term allows for the possibility that plain -differential privacy is broken with probability
 (which is preferably smaller than 1/| |). Usually, a paradigm of an approximating a deterministic
function : D → R with a differentially private mechanism is via additive noise calibrated 0 to
function’s sensitivity that is defined as the maximum of the absolute distance | ( ) − ( )|.
The representative and common additive noise mechanisms for real-valued functions are Laplace
mechanism and Gaussian mechanism, as respectively defined as follows:
 MGauss ( ; , , ) = ( ) + N ( , 2 ) (7)
 MLap ( ; , ) = ( ) + Lap( , ) (8)

The typical usage of differential privacy in the PPML solutions is located in two directions: (i) directly
adopting the above-mentioned additive noise mechanism on the raw dataset in the case of publishing
data, as illustrated in [99, 100]; (ii) transforming the original training method into a differentially
private training method so that the trained model has -differential privacy guarantee in the model
publishing, as illustrated in [34, 50, 51].
Sketching-based perturbation: Sketching is an approximate and simple approach for data stream
summary, by building a probabilistic data structure that serves as a frequency table of events, as
like the counting Bloom filters. Recent theoretical advances [101, 102] have shown that differential
privacy is achievable on sketches with additional modifications. For instance, the work in [103]

 13
focuses on privacy-preserving collaborative filtering, a popular technique for the recommendation
system, by using sketching techniques to implicitly provide the differential privacy guarantees by
taking advantage of the inherent randomness of the data structure. Recent work as reported in
[41] proposes a novel sketch-based framework for distributed learning, where they compress the
transmitted messages via sketches to achieve communication efficiency and provable privacy benefits
simultaneously.
In short, the traditional anonymization mechanisms and perturbation-based approaches are designed
to tackle general data publishing problems; however, those techniques are still not out-of-date in
the domain of PPML. Especially, the differential privacy technique has been widely adopted in
recent privacy-preserving deep learning and privacy-preserving federated learning systems such as
proposed in [34, 49, 50, 51, 55, 33]. Furthermore, differential privacy is not only employed as a
privacy-preserving approach, but also shows its promise in generating synthetic data [43, 104, 105]
and emerging generative adversarial networks (GAN) [106, 107].

5.1.3 Confusion-based Approaches

The confusion-based approach mainly refers to the cryptography technique that confuses the raw
data to achieve a much stronger privacy guarantee (i.e., confidential-level privacy) than traditional
anonymization mechanisms and perturbation-based approaches. Existing cryptography approaches
for data publishing for machine learning training include the following two directions: (i) applying the
traditional symmetric encryption schemes such as AES that are associated with the garbled-circuits
based secure multi-party computation protocols [108, 109, 110] and authentication encryption that
is associated with the customized secure computation protocols [36]; (ii) applying recent advanced
modern cryptosystems such as homomorphic encryption schemes [63, 64, 65, 66, 67] and functional
encryption schemes [68, 69, 70, 71, 72, 73, 74] that include the necessary algorithms to compute over
the ciphertext, such that one party with the issued key is able to acquire the computation results. The
typical PPML system such as proposed works in [94, 111] could be classified as the first direction of
the crypto-based data publishing approach, while more and more recent works such as proposed in
[55, 56, 33, 57, 58, 59, 60, 61, 62] focus on the direction (ii).
In particular, the crypto-based approaches cannot work independently and usually are associated with
related special process approaches, since it is expected to let the data receiver only learn the data
processing result rather than the original data. The crypto-based approaches provide a promising
candidate for data publishing, but the introduction and discussion emphasis of these approaches
need mainly focuses on how to share the one-time symmetric encryption keys or how to process the
encrypted data, and hence, more details will be presented in the next section.

5.2 Type II: Data Processing Approaches

Based on different data/model publishing approaches, the data processing approaches of training and
inference locate in two corresponding aspects: normal computation and secure computation. As Type
I approaches discussed in Section 5.1, if the data is published using traditional anonymization mech-
anisms or perturbation-based approaches, where the personal identifiers in the data are eliminated,
and the statistical result of the data is perturbed by adding differential privacy noise or building a
probabilistic data structure, the consequent training computation is as normal as the training computa-
tion in vanilla machine learning systems. Thus, the privacy-preserving data processing approach here
mainly refers to the secure computation that occurred at the training and inference phases.
The secure computation problems and the corresponding solutions are initialized by Andrew Yao in
1982 in a garbled-circuits protocol for two-party computation problems [112]. The primary goal of
secure computation is to enable two or more parties to evaluate an arbitrary function of both their
inputs without revealing anything to either party except for the function’s output. From the number
of the enrolled participants, those secure computation approaches could be classified as the basic
secure two-party computation (2PC) and secure multi-party computation (MPC or SMC). From the
perspective of the threat model (a.k.a, security model or security guarantee), those secure computation
protocols include two types of security according to various adversary settings: semi-hones (passive)
security and malicious (active) security. We refer the reader to [113, 114] for the systematization of
knowledge on the general secure multi-party computation solutions and corresponding threat models
in detail.

 14
You can also read