Scalable Privacy-Preserving Distributed Learning

Page created by Todd Clarke
 
CONTINUE READING
Scalable Privacy-Preserving Distributed Learning
Proceedings on Privacy Enhancing Technologies ; 2021 (2):323–347

David Froelicher*, Juan R. Troncoso-Pastoriza, Apostolos Pyrgelis, Sinem Sav, Joao Sa Sousa,
Jean-Philippe Bossuat, and Jean-Pierre Hubaux

Scalable Privacy-Preserving Distributed
Learning
Abstract: In this paper, we address the problem of
privacy-preserving distributed learning and the eval-
 1 Introduction
uation of machine-learning models by analyzing it in
 The training of machine-learning (ML) models usually
the widespread MapReduce abstraction that we extend
 requires large and diverse datasets [133]. In many do-
with privacy constraints. We design spindle (Scalable
 mains, such as medicine and finance, assembling suffi-
Privacy-preservINg Distributed LEarning), the first dis-
 ciently large datasets has proven difficult [128] and of-
tributed and privacy-preserving system that covers the
 ten requires the sharing of data among multiple data-
complete ML workflow by enabling the execution of a
 providers. This is particularly true in medicine, where
cooperative gradient-descent and the evaluation of the
 patients’ data are spread among multiple entities: For
obtained model and by preserving data and model confi-
 example, for rare diseases, one hospital might have only
dentiality in a passive-adversary model with up to N −1
 a few patients, whereas a medical study requires hun-
colluding parties. spindle uses multiparty homomor-
 dreds of them to produce significant results. Data shar-
phic encryption to execute parallel high-depth compu-
 ing among many entities, which can be located in multi-
tations on encrypted data without significant overhead.
 ple countries, is hence required. However, when the data
We instantiate spindle for the training and evaluation
 are sensitive and/or personal, they are particularly dif-
of generalized linear models on distributed datasets and
 ficult to share. Data sharing is highly restricted by legal
show that it is able to accurately (on par with non-
 regulations, such as GDPR [43] in Europe. The financial
secure centrally-trained models) and efficiently (due to
 and reputational consequences of a data breach often
a multi-level parallelization of the computations) train
 make the risk of data sharing higher than its potential
models that require a high number of iterations on
 benefits. Hence, it is often impossible to obtain suffi-
large input data with thousands of features, distributed
 cient data to train ML models that are key enablers in
among hundreds of data providers. For instance, it
 medical research [90], financial analysis [115], and many
trains a logistic-regression model on a dataset of one
 other domains.
million samples with 32 features distributed among 160
 To address this issue, privacy-preserving solutions
data providers in less than three minutes.
 are gaining interest as they can be key-enablers for
Keywords: federated learning, multiparty homomor- ML with sensitive data. Many solutions have been pro-
phic encryption, decentralized system, generalized lin- posed for secure predictions that use pre-trained mod-
ear models els [12, 17, 45, 61, 78, 104–106]. However, the secure
DOI 10.2478/popets-2021-0030
 training of ML models, which is much more computa-
Received 2020-08-31; revised 2020-12-15; accepted 2020-12-16. tionally demanding, has been less studied. Some cen-
 tralized solutions [7, 15, 23, 29, 50, 60, 63, 66] that
 rely on homomorphic encryption (HE) were proposed.
 They have the advantage of being straightforward to
 implement but require individual records to be trans-
 ferred out of the control of their owners, which is of-
 ten not possible, e.g., due to data protection legisla-
 tion [62, 77]. Also, the data are moved to a central
*Corresponding Author: David Froelicher:
Laboratory for Data Security (LDS), EPFL, E-mail: repository, which can become a single point of failure.
david.froelicher@epfl.ch Secure multiparty computation solutions (SMC) pro-
Juan R. Troncoso-Pastoriza, Apostolos Pyrgelis, posed for this scenario [3, 28, 42, 44, 57, 88, 95], of-
Sinem Sav, Joao Sa Sousa, Jean-Philippe Bossuat, ten assume that a limited number of computing par-
Jean-Pierre Hubaux: Laboratory for Data Security (LDS),
 ties are honest-but-curious and non-colluding. These as-
EPFL, E-mail: name.surname@epfl.ch
 sumptions might not hold when the data are sensitive
Scalable Privacy-Preserving Distributed Learning 324

and/or when the parties have competing interests. In uses the MapReduce abstraction [31] that is often used
contrast, homomorphic encryption-based (HE) or hy- to define distributed ML tasks [27, 118]. MapReduce de-
brid (HE and SMC) solutions [41, 131] that assume a fines parallel and distributed algorithms in a simple and
malicious threat model (e.g., Anytrust model [123]) fo- well-known abstraction: prepare (data preparation),
cus on limited ML operations (e.g., the training of reg- map (distributed computations executed independently
ularized linear models with low number of features) and by multiple nodes or machines), combine (combination
are not quantum secure. Recent advances in quantum of the map results, e.g., aggregation) and reduce (com-
computing [47, 56, 114, 127] have made this technology putation on the combined results). We build on and ex-
a potential threat, in a not-so-far future, for existing tend this abstraction to determine and delimit which
cryptographic solutions [89]. Google recently announced information, e.g., map outputs, have to be protected to
that they have reached "quantum-supremacy" [49]. Even design a decentralized privacy-preserving system for
though quantum computers are still far from being able ML training and prediction. The model is locally trained
to break state-of-the-art cryptoschemes, we note that by the DPs (map) and the results are iteratively com-
certain data (e.g., genomics) remain sensitive over a long bined (combine) to update the global model (reduce).
period and will be at risk in the future. We exploit the partitioned (distributed) data to enable
 Finally, federated learning, a non-cryptographic ap- DPs to keep control of their respective data, and we
proach for privacy-preserving training of ML models, distribute the computation to provide an efficient so-
has recently gained interest. The data remain under lution for the training of ML models on confidential
the control of their owners and a server coordinates data. After the training, the model is kept secret from
the training by sending the model directly to the data all entities and is obliviously and collectively used to
owners, which then update the model with their data. provide predictions on confidential data that are known
The updated models from multiple participants are av- only to the entity requesting the prediction. We remark
eraged to obtain the global model [68, 82]. Recent works that differential-privacy-based federated-learning solu-
have shown that sharing intermediate models with a co- tions [2, 22, 34, 55, 59, 64, 72, 99, 111, 117] follow the
ordinating server, or among the participants, can lead same model, i.e., they can be defined according to the
to various privacy attacks, e.g., extracting participants’ MapReduce abstraction. However, most solutions intro-
inputs [54, 122, 132] or membership inference [84, 92]. duce a trade-off between accuracy and privacy [58], and
To address these problems, multiple works [72, 83, 111] do not provide data and model confidentiality simulta-
rely on a differentially private mechanism to obfuscate neously. On the contrary, our solution uses a different
the intermediate values. However, this obfuscation de- paradigm in which, similarly to non-secure solutions,
creases the data and model utility, whereas the training the accuracy is traded off with the performance (e.g.,
of accurate models requires high privacy budgets and number of iterations), but not with privacy.
the achieved privacy level remains unclear [58]. We propose spindle (Scalable Privacy-preservINg
 Existing cryptographic distributed solutions are Distributed LEarning), a system that enables the
practical with only a small number of parties and most privacy-preserving, distributed (cooperative) execution
of the aforementioned solutions focus either on training of the widespread stochastic mini-batch gradient-
or on prediction. They neither consider the complete descent (SGD) on data that are stored and controlled by
ML workflow nor enable the training of a model that multiple DPs. spindle builds on a state-of-the-art mul-
remains secret and enables oblivious prediction on con- tiparty, lattice-based, quantum-resistant cryptographic
fidential data. In many cases, the trained model is as scheme to ensure data and model confidentiality, in the
sensitive as the data on which it is trained, and the use passive-adversary model in which all-but-one DPs can
of the model after the training has to be tightly con- be dishonest. spindle is meant to be a generic and
trolled. ML is used in very competitive domains and a widely-applicable system that supports the SGD-based
balance has to be found between collaboration and com- training of many different ML models. This includes,
petition [90, 113]. For example, entities that collaborate but is not limited to, support vector machines, graph-
to train a ML model should equally benefit from the ical models, generalized linear-models and neural net-
resulting model. works [33, 48, 69, 116, 130]. For concreteness and com-
 In this paper, we address the problem of privacy- parison with existing works, we instantiate spindle for
preserving learning and prediction among multiple par- the training of and prediction on generalized linear mod-
ties, i.e., data providers (DPs), that do not trust each els (GLMs) [93], (e.g., linear, logistic and multinomial
other. To address this issue, we design a solution that logistic regressions). GLMs are easily interpretable, cap-
Scalable Privacy-Preserving Distributed Learning 325

ture complex non-linear relations (e.g., logistic regres- To the best of our knowledge, spindle is the first opera-
sion), and are widely-used in many domains such as tional system that provides the aforementioned features
finance, engineering, environmental studies and health- and security guarantees.
care [76]. In a realistic scenario where a dataset of 11,500
samples and 90 features is distributed among 10 DPs,
spindle efficiently trains a logistic regression model in
less than 54 seconds, achieving an accuracy of 83.9%,
 2 Related Work
equivalent to a non-secure centralized solution. The dis-
tribution of the workload enables spindle to efficiently Privacy-Preserving Training of Machine Learn-
cope with a large number of DPs (parties), as its execu- ing Models. Some works have focused on securely out-
tion time is practically independent of it. spindle han- sourcing the training of linear ML models to the cloud,
dles a large number of features, by optimizing the use of typically by using homomorphic encryption (HE) tech-
the cryptosystem’s packing capabilities, and by exploit- niques [7, 15, 29, 50, 63, 66, 100]. For instance, Graepel
ing single-instruction multiple-data (SIMD) operations. et al. [50] outsource the training of a linear classifier by
It is able to perform demanding training tasks, with high employing somewhat HE [38], whereas Aono et al. [7]
number of iterations and thus high-depth computations, approximate logistic regression, and outsource its com-
by relying on the multiparty cryptoscheme’s ability to putation to the cloud by using additive HE [97]. Jiang
collectively refresh a ciphertext with no significant over- et al. [60] present a framework for outsourcing logistic
head. As shown by our evaluation, these properties en- regression training to public clouds by combining HE
able spindle to support training on large and complex with hardware-based security techniques (i.e., Software
data such as imaging or medical datasets. Moreover, Guard Extensions). In spindle, we consider a substan-
spindle scalability over multiple dimensions (features, tially different setting where the sensitive data are dis-
DPs, data) represents a notable improvement with re- tributed among multiple (untrusted) data providers.
spect to state-of-the-art secure solutions [41, 131]. Along the research direction of privacy-preserving
In this work, we make the following contributions: distributed learning, most works operate on the two-
 server model, where data owners encrypt or secret-share
– We analyze the problem of privacy-preserving dis- their data among two non-colluding servers that are
 tributed training and of the evaluation of ML models responsible for the computations. For instance, Niko-
 by extending the widespread MapReduce abstraction laenko et al. [95] combine additive homomorphic en-
 with privacy constraints. Following this abstraction, cryption (AHE) and Yao’s garbled circuits [125] to
 we instantiate spindle, the first operational and ef- enable ridge regression on data that are horizontally
 ficient distributed system that enables the privacy- partitioned among multiple data providers. Gascon et
 preserving execution of a complete machine-learning al. [42] extend Nikolaenko et al. work [95] to the case
 workflow through the use of a cooperative gradient of vertically partitioned datasets and improve its com-
 descent on a dataset distributed among many data putation time by employing a novel conjugate gradient
 providers. descend (GD) method, whereas Giacomelli et al. [44]
– We propose multiple optimizations that enable the further reduce computation and communication over-
 efficient use of a quantum-resistant multiparty (N- heads by using only AHE. Akavia et al. [3] improve
 party) cryptographic scheme by relying on parallel the performance of Giacomelli et al. protocols [44] by
 computations, SIMD operations, efficient collective performing linear regression on packed encrypted data.
 operations and optimized polynomial approximations Mohassel and Zhang [88] develop techniques to han-
 of the models’ activation functions, e.g., sigmoid and dle secure arithmetic operations on decimal numbers,
 softmax. and employ stochastic GD, which, along with multi-
– We propose a method for the parameterization of party-computation-friendly alternatives for non-linear
 spindle by capturing the relations among the secu- activation functions, supports the training of logistic
 rity and the learning parameters in a graphical model. regression and neural network models. Schoppmann et
– We evaluate spindle against centralized and decen- al. [108] propose data structures that exploit data spar-
 tralized secure solutions and demonstrate its scalabil- sity to develop secure computation protocols for nearest
 ity and accuracy. neighbors, naive Bayes, and logistic regression classifi-
 cation. spindle differs from these approaches as it does
Scalable Privacy-Preserving Distributed Learning 326

not restrict to the two non-colluding server model, and ML prediction, where a party (e.g., a cloud provider)
focuses instead on N-party systems, with N>2. holds an already trained ML model on which another
 Other distributed and privacy-preserving ML ap- party (e.g., a client) wants to evaluate its private input.
proaches employ a three-server model and rely on secret- In this setting, Bost et al. [17] use additive HE tech-
sharing techniques to train linear regressions [13], logis- niques to evaluate naive Bayes and decision tree classi-
tic regressions [26], and neural networks [87, 119]. How- fiers, whereas Gilad-Bachrach et al. [45] employ fully ho-
ever, such solutions are tailored to the three-party server momorphic encryption (FHE) [16] to perform prediction
model and assume an honest majority among the com- on a small neural network. The computation overhead
puting parties. An honest majority is also required in the of these approaches has been further optimized by using
recent work of Rachuri and Suresh [103], who improve multi-party computation (MPC) techniques [104, 106],
on Mohassel and Rindal [87] performance by extending or by combining HE and MPC [61, 78, 102]. Riazi
its techniques to the four-party setting. Other works fo- et al. [105] evaluate deep neural networks by employ-
cus on the training of ML models among N-parties (N ing garbled circuits and oblivious transfer, in combina-
> 4), with stronger security assumptions, i.e., each party tion with binary neural networks. Boemer et al. [12]
trusting itself. For instance, Corrigan-Gibbs and Boneh propose nGraph-HE2, a compiler that enables service
[28] present Prio, which relies on secret-sharing to en- providers to deploy their trained ML models in a
able the training of linear models, and Zheng et al. [131] privacy-preserving manner. Their method uses HE, or
propose Helen, a system that uses HE [97] and verifiable a hybrid scheme that combines HE with MPC, to com-
secret sharing [30] to execute ADMM [19] (alternating pile ML models that are trained with well-known frame-
direction method of multipliers, a convex optimization works such as TensorFlow [1] and PyTorch [98]. The
approach for distributed data), which supports regular- scope of our work is broader than these approaches, as
ized linear models. Similarly, Froelicher et al. [41] em- spindle accounts not only for the private evaluation
ploy HE [35], along with encoding techniques, to enable of machine-learning models but also for their privacy-
the training of basic regression models and provide au- preserving training in the distributed setting.
ditability with the use of zero-knowledge proofs. spin-
dle enables better scalability in terms of the number of
model’s features, size of the dataset and number of data
providers, and it offers richer functionalities by relying
 3 Secure Federated Training and
on the generic and widely-applicable SGD. Evaluation
 Another line of research considers the use of dif-
ferential privacy for training ML models. Early works We first introduce the problem of privacy-preserving
[2, 22] focus on a centralized setting where a trusted distributed training and evaluation of machine-learning
party holds the data, trains the ML model, and per- (ML) models. Then, we present a high-level overview
forms the noise addition. Differential privacy has also and architecture of a solution that satisfies the security
been envisioned in distributed settings, where to col- requirements of the presented problem. In Section 4,
lectively train a model, multiple parties exchange or we present spindle, a system that enables the pri-
send differentially private model parameters to a cen- vacy preserving and distributed execution of a stochas-
tral server [34, 55, 72, 111]. However, the training of tic gradient-descent. We instantiate our solution for the
an accurate collective model requires very high privacy training and evaluation of the widely-used Generalized
budgets and, as such, it is unclear what privacy protec- Linear Models [93]. In the rest of this paper, matrices
tion is achieved in practice [54, 58, 122]. To this end, are denoted by upper-case-bold characters and vectors
some works consider hybrid approaches where differen- by lowercase-bold characters; the i-th row of a matrix
tial privacy is combined with HE [64, 99], or multi-party X is depicted as X[i, ·], and its i-th column as X[·, i].
computation techniques [59, 117]. We consider differen- Similarly, the i-th element of a vector y is denoted by
tial privacy as an orthogonal approach; these techniques y[i]. We provide a list of recurrent symbols in Table 6
can be combined with our solution to protect the re- (see Appendix G).
sulting models and their predictions from inference at-
tacks [39, 112], see Section 8.1.
Privacy-Preserving Prediction on ML Models.
Another line of work is focused on privacy-preserving
Scalable Privacy-Preserving Distributed Learning 327

3.1 Problem Statement

We consider a setting where a dataset (Xn×c , yn ), with
Xn×c a matrix of n records and c features, and yn Fig. 1. spindle’s Model. Thick arrows represent a possible (effi-
a vector of n labels, is distributed among a set of cient) query-execution flow.
data providers, i.e., S = {DP1 , . . . , DP|S| }. The dataset
is horizontally partitioned, i.e., each data provider 3.2 Solution Overview
DPi holds a partition of ni samples (X (i) , y (i) ), with
P|S|
 i=1 ni = n. A querier, which can also be a data To address the problem of privacy-preserving dis-
provider (DP), requests the training of a ML model on tributed learning, we leverage the MapReduce abstrac-
the distributed dataset (Xn×c , yn ) or the evaluation of tion, which is often used to capture the parallel and
an already trained model on its input (X 0 , ·). repetitive nature of distributed learning tasks [27, 118].
 We assume that the DPs are willing to contribute We complement this abstraction with a protection
their respective data to train and to evaluate ML models mechanism P (·); P (x) denotes that value x has to be
on the distributed dataset. To this end, DPs are all inter- protected to satisfy data and model confidentiality (Sec-
connected and organized in a topology that enables effi- tion 3.1). We present the extended MapReduce ab-
cient execution of the computations, e.g., in a tree struc- straction in Protocol 1. In prepare, the data providers
ture as depicted in Figure 1. Even though the DPs wish (DPi ∈ S) pre-process their data (X (i) , y (i) ), they agree
to collaborate for the execution of ML workflows, they on the learning parameters and on one data provider
do not trust each other. As a result, they seek to protect that plays the role of DPR and is then responsible for
the confidentiality of their data (used for training and the execution of reduce. As explained later, DPR only
evaluation) and of the collectively learned model. More manipulates protected data and is subject to the same
formally, we require that the following privacy proper- security constraints as any other DP. We discuss the
ties hold in a passive-adversary model in which all-but- choice of DPR and its availability in Section 8. Each
one DPs can collude, i.e., the DPs follow the protocol, DPi then iteratively (g iterations) trains its local model
but up to |S| − 1 DPs might share among them the in- (P (W (i,j) ) at iteration j) on its data in map. They
formation they observe during the execution, to extract combine their local models in combine (through an
information about the other DPs’ inputs. application-dependent function C(·)), and update the
 (·,j)
(a) Data Confidentiality: The training data of each global model P (WG ) in reduce. To capture the com-
data provider DPi , i.e., (X (i) , y (i) ) and the querier’s plete ML workflow, we extend the MapReduce archi-
evaluation data (X 0 , ·) should remain only known to tecture with a prediction phase in which predictions
their respective owners. To this end, data confidential- P (y 0 ) are computed from the querier’s protected evalu-
ity is satisfied as long as the involved parties (DPs and ation data P (X 0 ) by using the (protected) global model
querier) do not obtain any information about other par- P (WG (·,g) ) obtained during the training.
ties’ inputs other than what can be deduced from the
 Protocol 1 Extended MapReduce Abstraction.
output of the process of training or evaluating a model.
 (·,g)
 training: S receives query from Querier and outputs P (W G )
(b) Model Confidentiality: During the training pro-
 1: Each DPi has (X (i) , y (i) )
cess, no data provider DPi should gain more informa- 2: DPs appoint DPR and agree on learning params. – prepare
tion about the model that is being trained than what it
 3: Each DPi ∈ S initializes its local model W (i,0)
can learn from its own input data (X (i) , y (i) ). During
 4: for j = 1, . . . , g do
prediction, the querier should not learn anything more
 5: Each DPi ∈ S computes: – map
about the model than what it can infer from its input (·,j−1)
 P (W (i,j) ) ← Map((X (i) , y (i) ), P (WG ), P (W (i,j−1) ))
data (X 0 , ·) and the corresponding predictions y 0 .
 6: Each DPi sends P (W (i,j) ) to DPR – combine
 We remark here that input correctness and com-
 7: DPR : P (W (·,j) ) ← C(P (W (i,j) )), ∀ DPi ∈ S
putation correctness are not part of the problem re- – reduce
quirements, i.e., we assume that DPs input correct data 8: DPR : P (WG (·,j) )←Red(P (WG (·,j−1) ), P (W (·,j) ))
and do not perform wrong computations. We discuss prediction: DP receives P (X 0 ) from Querier and uses
 R
possible countermeasures against malicious DPs in Sec- P (WG (·,g) ) to compute P (y 0 ) that is sent back to the Querier
tion 8.1.
Scalable Privacy-Preserving Distributed Learning 328

 convergence depending on the distributed parameters;
4 SPINDLE Design e.g., the number of iterations and the update func-
 tion for the global weights [18, 120, 121, 129]; and (iv)
Following the extended MapReduce abstraction de-
 it has been shown to work well even in the case of
scribed in Section 3.2, we design a system, named
 non-independent-and-identically-distributed (non-i.i.d.)
spindle, that enables the privacy-preserving execu-
 data partitions [81, 120, 121]. The data providers (DPs),
tion of the widely applicable cooperative gradient de-
 each of which owns a part of the dataset, locally perform
scent [120, 121] – which is used to minimize many cost
 multiple iterations of the SGD before aggregating their
functions in machine-learning [69, 116, 130]. We instan-
 model weights into the global model weights. The global
tiate this system for the training of and prediction on
 weights are included in subsequent local DP computa-
Generalized Linear Models [93]. To implement the pro-
 tions to avoid that they learn, or descend, in the wrong
tection mechanism P (·), it builds on a multiparty fully
 direction. For simplicity, we present spindle with the
homomorphic encryption scheme. We introduce these
 synchronous CSGD version, where the DPs perform lo-
concepts in Section 4.1. Then, in Section 4.2, we describe
 cal model updates simultaneously. For each DPi , the
how spindle instantiates the phases of the extended
 local update rule at global iteration j and local itera-
MapReduce abstraction and how we address the collec-
 tion l is:
tive data-processing on the distributed dataset through
 (·,j -1)
secure and interactive protocols. We demonstrate how w(i,j,l) =w(i,j,l-1) − αζ(w(i,j,l-1) ; B (l) ) − αρ(w(i,j,l-1) − wG ),
training is performed, notably by executing the gradient (·,j−1) (1)
 where wG are the global weights from the last
descent operations under homomorphic encryption, and
 global update iteration j − 1, α is the learning rate and
how predictions are executed on encrypted models. We
 ρ, the elastic rate, is the parameter that controls how
present the detailed cryptographic operations in Sec-
 much the data providers can diverge from the global
tion 5 and analyze spindle’s security in Appendix C.
 model. The set of DPs S performs m local iterations
 between each update of the global model that is up-
 dated at global iteration j with a moving average by:
4.1 Background (·,j) (·,j−1) P|S| (i,j,m)
 wG = (1 − |S|αρ)wG + αρ i=0
 w . (2)

Cooperative Gradient-Descent. We rely on a dis- Generalized Linear Models (GLMs). GLMs [93]
tributed version of the popular mini-batch stochastic are a generalization of linear models where the lin-
gradient-descent (SGD) [69, 116, 130]. In the standard ear predictor, i.e., the combination Xw of the feature
version of SGD, the goal is to minimize minw [F (w) := matrix X and weights vector w, is related to a vec-
 Pn
(1/n) φ=1 f (w; X[φ, ·])], where f (·) is the loss func- tor of class labels y by an activation function σ such
tion defined by the learning model, w ∈ Rc are the that E(y) = σ −1 (Xw), where E(y) is the mean of y.
model parameters, and X[φ, ·] is the φth data sam- In this work, we consider the widely-used linear (i.e.,
ple (row) of X. The model is then updated by m it- σ(Xw) = Xw), logistic (i.e., σ(Xw) = 1/(1 + e−Xw ))
 and multinomial (i.e., σ(Xwλ ) = eXwλ /( j∈cl eXwj ),
 P
erations w(l) = w(l−1) − α[ζ(w(l−1) ; B (l) )], for l =
1, . . . , m, with α the learning rate, B (l) a randomly for λ ∈ cl) regression models. We remark that for multi-
sampled sub-matrix of X of size b × c, and ζ(w; B) = nomial regression, the weights are represented as a ma-
B T (σ(Bw) − I(z)), where z is the vector of labels cor- trix Wc×|cl| , where c is the number of features, cl is the
responding to the batch B. The activation function σ set of class labels and |cl| its cardinality. In the rest of
and I(·) are both model-dependent, e.g., for a logistic the paper, unless otherwise stated, we define the oper-
regression σ is the sigmoid and I(·) is the identity. ations on a single vector of weights w and we note that
 We rely on the cooperative SGD (CSGD) pro- in the case of multinomial regression, they are repli-
posed by Wang and Joshi [120, 121], due to its prop- cated on the |cl| vectors of weights, i.e., each column of
erties; in particular: (i) modularity, as it can be syn- Wc×|cl| .
chronous or asynchronous, and can be combined with Multiparty Homomorphic Encryption. For the
classic gradient-descent convergence optimizations such protection mechanism of spindle, we rely on a mul-
as Nesterov accelerated SGD [94]; (ii) applicability, as tiparty (or distributed) fully-homomorphic encryption
it accommodates any ML model that can be trained scheme [91] in which the secret key is distributed among
with SGD and enables the distribution of any SGD the parties, while the corresponding collective public key
based solution; (iii) it guarantees a bound on the error- pk is known to all of them. Thus, each party can inde-
Scalable Privacy-Preserving Distributed Learning 329

pendently compute on ciphertexts encrypted under pk (Distributed) Operations: A vector v of cleartext val-
but all parties have to collaborate to decrypt a cipher- ues can be encrypted with the public collective key
text. In spindle, this enables the data providers (DPs) pk and can be decrypted with the collaboration of all
to train a collectively encrypted model, that cannot be DPs (DDec(·) protocol, in which each DPi uses its se-
decrypted as long as one DP is honest and refuses to par- cret key ski ). The DPs can also change a ciphertext
ticipate in the decryption. As we show later, this multi- encryption from the public key pk to another public
party scheme also enables DPs to collectively switch the key pk 0 without decrypting the ciphertext, by relying
encryption key of a ciphertext from pk to another pub- on the DKeySwitch(·) protocol. Each DP can indepen-
lic key without decrypting. In spindle, a collectively dently add, multiply, rotate (i.e., inner-rotation of v),
encrypted prediction result can thus be switched to the rescale Rescale(·) or relinearize Relin(·) a vector en-
querier’s public key, so that only the querier can decrypt crypted with pk. When two ciphertexts are multiplied
the result. together, the result has to be relinearized Relin(·) to
 Mouchet et al. [91] propose a multiparty version of preserve the ciphertext size. After multiple Rescale(·)
the Brakerski Fan-Vercauteren (bfv) lattice-based ho- operations, hvi has to be refreshed by a collective pro-
momorphic cryptosystem [38] and introduce interactive tocol, i.e., DBootstrap(·), which returns a ciphertext at
(distributed) protocols for key generation DKeyGen(·), level L. The dot product DM(·) of two encrypted vectors
decryption DDec(·), and bootstrapping DBootstrap(·). of size a can be executed by a multiplication followed
We use an adaptation of this multiparty scheme to the by log2 (a) inner-left rotations and additions. We list all
Cheon-Kim-Kim-Song cryptosystem (ckks) [25] that the operations used in spindle and their properties in
enables approximate arithmetic, and whose security is Appendix A.
based on the ring learning with errors (rlwe) prob-
lem [80]. ckks (See Appendix A) enables arithmetic
over CN/2 ; the plaintext and ciphertext spaces share the 4.2 SPINDLE Protocols
same domain RQ = ZQ [X]/(X N + 1), with N a power
of 2. Both plaintexts and ciphertexts are represented by We first describe spindle’s operations for training a
polynomials of N coefficients (degree N − 1) in this do- Generalized Linear Model following Protocol 1. In this
main. A plaintext/ciphertext encodes a vector of up to case, the model W is a vector of weights that we denote
N/2 values. by w, and map corresponds to multiple local iterations
Parameters: The ckks parameters are denoted by of the gradient descent. Recall that in the case of multi-
the tuple (N, ∆, η, mc), where N is the ring dimen- nomial regression, all operations are repeated for each
sion, ∆ is the plaintext scale, or precision, by which label class λ ∈ cl.
any value is multiplied before being quantized and en-
crypted/encoded, η is the standard deviation of the
noise distribution, and mc represents a chain of moduli 4.2.1 TRAINING
{q0 , . . . , qL } such that Πι∈{0,...,τ } qι = Qτ is the cipher-
text modulus at level τ , with QL = Q, the modulus of PREPARE. The data providers (DPs) collectively
fresh ciphertexts. Operations on a level-τ ciphertext hvi agree on the training parameters: the maximum num-
are performed modulo Qτ , with ∆ always lower than ber of global g and local m iterations, and the learn-
the current Qτ . Ciphertexts at level τ are simply vec- ing parameters lp = {α, ρ, b}, where α is the learn-
tors of polynomials in RQτ , that we represent as hvi ing rate, ρ the elastic rate, and b the batch size. The
when there is no ambiguity about their level, and use DPs also collectively initialize the cryptographic keys for
{hvi, τ, ∆} otherwise. After performing operations that the distributed ckks scheme by executing DKeyGen(·)
increase the noise and the plaintext scale, {hvi, τ, ∆} (see Appendix A). Then, the DPs initialize their local
has to be rescaled (see the ReScale(·) procedure defined weights and pre-compute operations that involve only
in Appendix A) and the next operations are performed their input data (αX (i) I(y (i) ) and αX (i)T ). We discuss
modulo Qτ −1 . When reaching level 0, hvi has to be boot- in Appendix F how the DPs can collaborate to stan-
strapped. The security of the cryptosystem depends on dardize or normalize the distributed dataset (if needed)
the choice of N , Q and η, which in this work are param- and check that their respective inputs are consistent,
eterized to achieve at least 128-bits of security. e.g., they have data distribution homogeneity.
Scalable Privacy-Preserving Distributed Learning 330
Protocol 2 map.
 Protocol 3 Activation Function σ(·).
 (·,j−1)
Each DPi outputs hw(i,j) i ← Map ((X (i) , y (i) ), hwG i,
 Func. σ(hui or hU i, t) returns the activated hσ(u)i or hσ(U )i
hw(i,j−1) i)
 1: if t is Linear then hσ(u)i = hui
 1: hw(i,j,0) i = hw(i,j−1) i
 2: else if t is Logistic then
 2: for l = 1, . . . , m :
 3: hσ(u)i = apSigmoid(u)
 3: Select batch (B, z) of b rows in (X (i) , y (i) )
 4: hu[k]i = DM(B[k, ·], hw(i,j,l−1) i), for k = 1, . . . , b 4: else if t is Multinomial, input is a matrix hUc×|cl| i then
 5: hmi = apMax(hU i)
 5: hv[e]i = DM(αB[·, e]T , σ(hui)), for e = 1, . . . , c
 Pb 6: for λ ∈ cl:
 6: µ[e] = k=1
 αB[·, e]T I(z[k]), for e = 1, . . . , c 7: hU 0 [λ, ·]i = hU [λ, ·]i − hmi
 7: hw(i,j,l) i= hw(i,j,l−1) i+µ-hvi 8: hσ(U [λ, ·])i=M(apSoftN(hU 0 [λ, ·]i), apSoftD(hU 0 [λ, ·]i))
 -αρ(hw(i,j,l−1) i-hwG (·,j−1) i)
 8: hw(i,j) i = RR(hw(i,j,m) i) approximation that is computed by the multiplication
 of two CAs, one for the nominator ex (apSoftN(·)) and
MAP. As depicted in Protocol 2, the DPs execute one for the denominator P1 xj (apSoftD(·)), each com-
m iterations of the cooperative gradient-descent local e
 puted on different intervals. The polynomial approxima-
update (Section 4.1). The local weights of DPi (i.e.,
 tion computation is detailed in Protocol 6 (Appendix
hw(i,j,l−1) i) are updated at a global iteration j and a
 B). To avoid an explosion of the exponential values in
local iteration l by computing the gradient (Protocol 2,
 the softmax, a vector hmi that contains the approxi-
lines 4, 5, and 6) that is then combined with the cur-
 (·,j−1) mated max (apMax(·)) value of each column of hU i is
rent global weights hwG i (Protocol 2, line 7) fol-
 subtracted from all input values, i.e., from each hU [λ, :]i
lowing Equation 1. These computations are performed
 with λ = 0, ..., |cl|. Similar to softmax, the approxima-
on batches of b samples and c features. To ensure that
 tion of the max function requires two CAs, and is de-
the update of DPi ’s local weights, i.e., the link between
 tailed in Appendix B.
the ciphertexts hw(i,j−1) i = hw(i,j,0) i and hw(i,j,m) i,
does not leak information about the DP’s local data, COMBINE. The map outputs of each DPi , i.e.,
hw(i,j,m) i is re-randomized RR(·) at the end of map, i.e., hw(i,j) i, are homomorphically combined ascending a
DPi adds to it a fresh encryption of 0. Note that in Pro- tree structure, such that each DPi aggregates its en-
tocol 2, line 5 the activation function σ(·) is computed crypted updated local weights with those of its children
on the encrypted vector hui (or a matrix hU i in the and sends the result to its parent. In this case, the com-
case of multinomial). The exponential activation func- bination function C(·) is the homomorphic addition op-
tions for logistic (i.e., sigmoid) and multinomial (i.e., eration. At the end of this phase, the DP at the root of
softmax) regressions have to be approximated to poly- the tree DPR obtains the encrypted combined weights
nomial functions to be evaluated on encrypted data by hw(·,j) i.
using the homomorphic properties of ckks. We rely on a REDUCE. DPR updates the encrypted global weights
 (·,j)
least-square polynomial approximation (LSPA) for the hwG i, as shown in Protocol 4. More precisely, it
sigmoid, as it provides an optimal average mean-square computes Equation 2 by using the encrypted sum of
error for uniform inputs in a specific interval, which is the DPs’ updated local weights hw(·,j) i (obtained from
 (·,j−1)
a reasonable assumption when the input distribution combine), the previous global weights hwG i, the
is not known. For softmax, we rely on Chebyshev ap- pre-defined elastic rate ρ and the learning rate α. After
proximation (CA) to minimize the maximum approxi- g iterations of the map, combine, and reduce, DPR
 (·,g)
mation error and thus avoid that the function diverges obtains the encrypted global model hwG i and broad-
on specific inputs. The approximation intervals can be casts it to the rest of the DPs.
empirically determined by using synthetic datasets with
 Protocol 4 Reduce.
distribution similar to the real ones, by computing the (·,j) (·,j−1)
 DPR computes hwG i ← Red(hwG i, hw(·,j) i, ρ, α)
minimum and maximum input values over all DPs and (·,j) (·,j−1)
 1: hwG i = (1 − αρ|S|)hwG i + αρhw(·,j) i
features, or by relying on estimations based on the data
distribution [53]. Protocol 3 takes as input an encrypted
vector/matrix hui or hU i and the type of the regression
t (i.e., linear, logistic or multinomial). If t is linear, the 4.2.2 PREDICTION
protocol simply returns hui. Otherwise, if t is logistic,
it computes the activated vector hσ(u)i by using the The querier’s input data (X 0 , ·) is encrypted with the
sigmoid’s LSPA (apSigmoid(·)). If t is multinomial, it collective public key pk. Then, hX 0 ipk is multiplied
 (·,g)
computes the activated matrix hσ(U )i using the softmax (DM (·, ·) with the weights of the trained model hwG i
Scalable Privacy-Preserving Distributed Learning 331

and processed through the activation function σ(·) to form a ReScale(·) only when this condition is met after
obtain the encrypted prediction values hy 0 i (one predic- a series of consecutive operations.
tion per row of X 0 ). The prediction results encrypted Relinearization. Letting the ciphertext size increase
under pk are then collectively switched by the DPs to after every multiplication would add to the subsequent
the querier public key pk 0 using DKeySwitch(·), so that operations an overhead that is higher than the relin-
only the querier can decrypt hypk0 i.
 0 earization. Hence, to maintain the ciphertext size and
Protocol 5 prediction. degree constant, a Relin(·) operation is performed af-
DPR gets hXn0 0 ×c i from Querier and computes hyn
 0 i using
 0 ter each ciphertext-ciphertext multiplication. We here
 (·,g)
hwG i note that a Relin(·) operation can be deferred if do-
 (·,g)
1: hy 0 [p]i = σ(DM(hX 0 [p, ·]i, hwG i)), for p = 0, ..., n0 ing so incurs in lower computational complexity (e.g.,
2: hy 0 ipk0 = DKeySwitch(hy 0 i, pk0 , {ski }) if additions performed after the ciphertext-ciphertext
 multiplications reduce the number of ciphertexts to re-
 linearize).
5 System Operations Bootstrapping. In the protocols of Section 4.2, we ob-
 serve that the data providers’ local weights and the
We describe how spindle relies on the properties of model global weights (hwi and hwG i, resp.) are the
the distributed version of ckks to efficiently address only persistent ciphertexts over multiple computations
the problem of privacy-preserving distributed learning. and iterations. They are therefore the only ciphertexts
We first describe how we optimize the protocols of Sec- that need to be bootstrapped, and we consider three ap-
tion 4.2 by choosing when to execute cryptographic proaches for this. With Local bootstrap (LB), each
operations such as rescaling and (distributed) boot- data provider (DP) bootstraps (calling a DBootstrap(·)
strapping. Then, we discuss how to efficiently perform protocol) its local weights, every time they reach level τb
the map protocol that involves a sequence of vector- during the map local iterations and before the combine.
matrix-multiplications and the evaluation of the activa- As a result, the global weights are always combined with
tion function, in the encrypted domain. fresh encryptions of the local weights and only need
 to be bootstrapped after multiple reduce. Indeed, re-
 duce involves a multiplication by a constant hence a
5.1 Cryptographic Operations Rescale(·). With Global bootstrap (GB), we use the
 interdependency between the local and global weights,
As explained in Section 4.1 (and Appendix A), cipher- and we bootstrap only the global weights and assign
text multiplications incur the execution of other cryp- them directly to the local weights. The bootstrapping is
tographic operations hence increase spindle’s computa- performed on the global weights during reduce. Thus,
tion overhead. This overhead can rapidly increase when we modify training so that map operates on the (boot-
the same ciphertext is involved in sequential operations, strapped) global weights, i.e., hw(i,j−1) i = hwG (·,j−1) i,
i.e., when the operations’ multiplicative depth is high. for a DPi at global iteration j. By following this ap-
As we will describe in Section 7, spindle relies on the proach, the number of bootstrap operations is reduced,
Lattigo [85] lattice-based cryptographic library, where a with respect to the local approach, because it is per-
ciphertext addition or multiplication requires a few ms, formed by only one DP and depends only on the num-
whereas Rescale(·), Relin(·), and DBootstrap(·), are 1- ber of global iterations. However, it modifies the learn-
order, 2-orders, and 1.5-orders of magnitude slower than ing method, and it offers less flexibility, as the number
the addition, respectively. These operations can be com- of local iterations in map is constrained by the num-
putationally heavy, hence their execution in the proto- ber of ciphertext multiplications required in each iter-
cols should be optimized. Note that we avoid the use ation and by the available ciphertext levels. With Hy-
of the centralized traditional bootstrapping, as it would brid bootstrap (HB), both GB and LB approaches
require a much more conservative parameterization for are combined to reduce the total number of bootstrap-
the same security level, resulting in higher computa- ping operations. The global weights are bootstrapped at
tional overheads (see Section 7). each global iteration (GB) and the DPs can still perform
Lazy Rescaling. To maintain the precision of the en- many local iterations by relying on the LB. In our exper-
crypted values and for efficiency we rescale a ciphertext iments (Section 7.2), we observed that the effect on the
{hvi, τ, ∆} only when ∆ is close to qτ . Hence, we per- trained model’s accuracy depends mainly on the data
Scalable Privacy-Preserving Distributed Learning 332

 RBA Dup. RowP
 ⟨w⟩ =
 RowP x (b) x (d) ⍺BT b x + RotLc x
 B= b = = = (g) =
 (a) = = c
 RotL2c
 c
 (f) + (h) =
 RotL&Add1 (c) RotR&Add1 (e) ⟨v⟩
 ⟨u⟩= σ =
 Log2(b) Log2(c) Log2(c) Log2(b) Ciphertext Rotations
 DA ⟨w⟩ = RotLi (c’)
 (a’) i,j (b’) i,j
 σ
 N1=P2(max(c,b))/N2 0,0 0,0 i,j RotLN1⋅j
 0 ≤ i ≤ N1, 0 ≤ j ≤ N2 Diag. + 0,0
 +
 B= b
 1,0
 x
 1,0
 0,1 = ⍺BT= c Repeat ⟨v⟩
 N2= ⌊ P2 (max �, � ) ⌋ rotRN1⋅j
 0,1 + 0,1 = (A)
 c
 1,1 1,1 RotLN1⋅j ⟨u⟩ b
 (A)
 N1-1 N2-1 Ciphertext Rotations

Fig. 2. Packing approaches for executing Protocol 2, lines 4 and 5. We assume that c · b < N/2 and show an example with c = b = 4.
Dash elements are plaintext values, everything else is encrypted. Dup duplicates and adds, rowP packs the rows in one ciphertext,
RotL(/R)&Addi rotates the encrypted vector by i, 2i, 4i, . . . to the left(/right) and at each step, aggregates the result with the pre-
vious ciphertext, RotL(/R)j rotates a vector left(/right) by j positions. P2 (x) returns the next power of 2 larger than x.
and that, in most cases, enabling DPs to perform more and the DP computation capabilities. Figure 2 depicts
local iterations (LB and HB) between two global up- spindle’s packing approaches for a toy example of the
dates yields better accuracy. Even though LB incurs at computation of hui (Protocol 2, line 4) whose result is
least |S| more executions of the DBootstrap(·), the DPs activated (i.e., σ(hui)) before used in the computation
execute them in parallel and thus amortize the overhead of hvi (Protocol 2, line 5), for a setting with c = b = 4.
on spindle’s execution time. However, if the training of For clarity, we assume that a vector of c (number of fea-
a dataset requires frequent global updates, then GB (or tures) or b (batch size) elements can be encoded in one
HB) achieves a better trade-off, see Section 7.2. Taking ciphertext (or plaintext), i.e., max(c, b) ≤ N/2.
into account these cryptographic transformations and Row-Based Approach (RBA). This approach was
the strategy to optimize their use in spindle, we ex- proposed by Kim et al. [63]. The input matrices (B
plain how to optimize the required number of ciphertext and αB T ) are packed row-wise, and multiple rows are
operations. packed in one plaintext ((a) in the upper part of Fig-
 ure 2), i.e., the number of plaintexts required to en-
 code the input matrix is d c·b·2
 N e. Each plaintext is then
5.2 MAP Vector-Matrix Multiplications multiplied with a ciphertext containing the replicated
 weights’ vector (b), such that the number of replicas is
As described in Section 4.1, each ckks ciphertext en- equal to the number of rows in B. To obtain the results
crypts (or packs) a vector of values, e.g., 8,192 elements of the dot products between each weights’ vector and
if the ring dimension is N = 214 . This packing enables row of B, a partial inner sum is performed by adding
us to simultaneously perform operations on all the vec- the resulting ciphertext with rotated versions of itself
tor values, by using a Single-Instruction Multiple Data (c). The values in between the dot product results are
(SIMD) approach for parallelization. To execute com- eliminated (i.e., masked) through a multiplication with
putations among values stored in different slots of the a binary vector (d), and the dot product results are du-
same ciphertext, e.g., an inner sum, we rely on cipher- plicated in the ciphertext (e) such that it can be ac-
text rotations that have a computation cost similar to a tivated (σ(·)) and used directly for the multiplication
relinearization (Relin(·)). Recall that for the execution with αX T (f ). The result is then rotated and added to
of stochastic gradient-descent, each local iteration in itself (g) such that it can be masked (h) to obtain hvi.
map involves two sequential multiplications between en- As shown in Figure 2, the total number of vector multi-
crypted vectors and cleartext matrices (Protocol 2, lines plications is d c·b·2
 N e · 4, whereas the number of ciphertext
4 and 5). As a result, packing is useful for reducing the c·b·2
 rotations is d N e · 2 · (log(b) + log(c)). This approach
number of vector multiplications and rotations needed has a multiplicative depth of am + 4, where am denotes
to perform these operations. To this end, spindle in- the depth of the activation function σ(·).
tegrates two packing approaches and automatically se-
 Diagonal Approach (DA). This approach was pre-
lects the most appropriate approach at each DP during
 sented by Halevi and Shoup [51] as an optimized ho-
the training. We now describe these two approaches and
 momorphic vector-matrix-multiplication evaluation. It
how to choose between them, depending on the settings,
 optimizes the number of ciphertext rotations by trans-
i.e., the learning parameters, the number of features,
Scalable Privacy-Preserving Distributed Learning 333

forming the input plaintext matrix B. In particular, B , , 
 Xnxc , 
is diagonalized, and each line is rotated ((a0 ) in lower Δ
 
part of Figure 2) so that they can be independently 
 [ ! , ! ]
multiplied with the (rotated) weights’ vector (b0 ). The
resulting ciphertexts are aggregated and rotated to ob- Fig. 3. System parameters graph. Circles and dotted circles repre-
tain hui (c0 ), and a similar approach is used to com- sent learning and cryptographic parameters, respectively.
pute hvi after the activation. As shown in Figure 2, then discuss two modular functionalities of spindle,
DA only executes 2 · ((N1 − 1) + (N2 − 1)) rotations on namely data outsourcing and model release.
the encrypted
 p vector, with N1 = P 2 (max(c, b))/N2 and Parameter Selection. spindle relies on the configu-
N2 = b P 2 (max(c, b))c, where P 2 (x) returns the next ration of (a) cryptographic parameters that determine
power of 2 larger than x. This approach involves N1 · N2 its security level, and (b) learning parameters that af-
plaintext-ciphertext multiplications on independent ci- fect the accuracy of the training and evaluation of the
phertexts and does not require any masking, which re- models. Both are tightly linked, and we capture these
sults in a multiplicative depth of am + 2. Therefore, this relations in a graph-based model, displayed in Figure 3,
approach consumes fewer levels than RBA. where vertices and edges represent the parameters and
 In both approaches, the number of rotations and their interdependence, respectively. For simplicity, we
multiplications depends on the batch size b and the present a directed graph that depicts our empirical
number of features c. DA almost always requires more method for choosing the parameters (see Appendix G,
multiplications than RBA and uses more rotations af- Table 6 for notation symbols). We highlight that the
ter a certain c (e.g., if b = 8, the break-even happens at corresponding non-directed graph is more generic and
c = 64). However, as DA is embarrassingly parallelizable simply captures the main relations among the parame-
for both multiplications and rotations (with rotations ters. We observe two main clusters: the cryptographic
being the most time-consuming operations), the com- parameters on the upper part of the graph (dotted cir-
putations can be amortized on multiple threads. Taking cles), and the learning parameters (circles) on the lower
this into account, spindle automatically chooses, based one. The input data and their intrinsic characteristics,
on c, b, and the number of available threads, the best i.e., the number of features c or precision (bits of preci-
approach at each DP. We analyze these trade-offs in sion required to represent the data), are connected with
Section 7. both clusters that are also interconnected through the
 plaintext scale ∆. As such, there are various ways to
 configure the overall system parameters.
5.3 Optimized Activation Function In our case, we decide to first choose N (ciphertext
 polynomial degree), such that at least c elements can be
As described in Section 4.2, to enable their execution packed in one ciphertext. Q (ciphertext modulus) and
under FHE, we approximate the sigmoid (apSigmoid(·)) η (fresh encryption noise) are then fixed to ensure a
and softmax (apMax(·), apSoftN(·), apSoftD(·)) activa- sufficient level of security (e.g., 128-bits) following the
tion functions with least-squares and Chebyshev poly- accepted parameterization from the homomorphic en-
nomial approximations (PA), respectively. We adapt the cryption standard whitepaper [4]. The scale ∆ is con-
baby-step giant-step algorithm introduced by Han and figured to provide enough precision for the input data
Ki [52] to enable the minimum-complexity computation X, and mc (moduli chain) and L (number of levels)
of degree-d polynomials (multiplicative depth of dlog(d)e are set accordingly. The intervals [ai , gi ] used for the
for d ≤ 7, and with depth dlog(d) + 1e otherwise). Pro- approximations of the activation functions are defined
tocol 6 in Appendix B computes the (element-wise) ex- according to X. The approximation degrees d are then
ponentiation of the encrypted input vector before recur- set depending on these intervals and the available num-
sively computing the polynomial approximation. ber of levels L. The remaining learning parameters (α,
 ρ, b, g, m) are agreed upon by the data providers based
 on their observation of their part of the dataset. Note
6 System Configuration that the minimum values for the learning rate α and
 elastic rate ρ are limited by the scale ∆, and if they are
We discuss how to parameterize spindle by taking into
 too small the system might not have enough precision
account the interdependencies between the input data,
 to handle their multiplication with the input data.
and the learning and cryptographic parameters. We
Scalable Privacy-Preserving Distributed Learning 334

Data Outsourcing. spindle’s protocols (Section 4.2) early on the number of DPs |S| and the number of iter-
seamlessly work with data providers (DPs) that either ations, and logarithmically on the number of features c
have their input data X in cleartext, or that obtain and batch size b; all these parameters depend also on the
data hXipk encrypted under the public collective key dataset size. As shown in Section 5.2, the DA packing
from their respective owners. In the latter case, spin- approach incurs a higher computation complexity but is
dle enables both secure data storage and computa- embarrassingly parallel and can be more time-efficient
tion outsourcing to always-available untrusted cloud than RBA depending on the available threads. The ac-
providers. It distributes the workload among multi- tivation function is the only operation that requires
ple data providers and is still able to rely on effi- ciphertext-ciphertext multiplications; its complexity de-
cient multiparty homomorphic-encryption operations, pends logarithmically on the approximation degree. We
e.g., DBootstrap(·). We note that operating on en- empirically study the link between the approximation
crypted input data affects the complexity of map, as all degree and the training accuracy in Section 7.2. spin-
the multiplication operations (Protocol 2) would hap- dle’s other steps and protocols only involve lightweight
pen between ciphertexts, instead of between the cleart- operations, i.e., ciphertexts additions and multiplica-
ext inputs and ciphertexts. tions with plaintext values.
Model Release. By default, the trained model in spin-
dle is kept secret from any entity, enabling privacy-
preserving predictions on (private) evaluation-data in- 7.2 Empirical Evaluation
put by the querier and offering end-to-end model confi-
dentiality. If required by the application setting, spin- We implemented spindle in Go [46]. Our implementa-
dle can also reveal the trained model to the querier or tion builds on top of Lattigo [85], an open-source Go
to a third party. This is collectively enabled by the DPs, library for lattice-based cryptography, and Onet [96],
who perform a DKeySwitch(·). an open-source Go library for building decentralized
 systems. The communication between data providers
 (DPs) is done through TCP with secure channels (us-
7 System Evaluation ing TLS). We evaluate our prototype on an emulated
 realistic network, with a bandwidth of 1 Gbps between
We first analyze the theoretical complexity of spindle every two nodes, using Mininet [86]. We deploy spindle
before moving to the empirical evaluation of its proto- on 5 Linux machines with Intel Xeon E5-2680 v3 CPUs
type and its comparison with existing solutions. running at 2.5GHz with 24 threads on 12 cores and 256
 Giga Bytes RAM, on which we evenly distribute the
 DPs. We first provide spindle’s cryptographic opera-
7.1 Theoretical Analysis tions micro-benchmarks before assessing spindle accu-
 racy and performance by testing it on multiple publicly-
We refer to Table 4a (Appendices E.1 and E.2) for the available datasets: CalCOFI [20] for linear regression,
full complexity analysis of spindle’s protocols. We dis- BCW [10], PIMA [101] and ESR [37] for logistic re-
cuss here its main outcomes. gression, and MNIST [70] for multinomial regression
Communication Complexity. spindle’s communi- (see Appendix E.3 for details on the datasets). We then
cation complexity depends linearly on the number of show spindle’s scalability by using randomly generated
data providers |S|, iterations (g, m) and the ciphertext (larger) datasets with up to 8,192 features and 4 million
size |ct|. In map, the only communication between the data samples. Our evaluation shows spindle practical-
DPs is due to the DBootstrap(·), which requires two ity for large-dimensional datasets, making it suitable for
rounds of communication of one ciphertext (ct) between demanding learning tasks such as the training on imag-
the |S| DPs (i.e., 2 · (|S| − 1) · |ct|). In combine and re- ing or genomic datasets [11, 36, 71].
duce, the DPs exchange one ciphertext in respectively We employ two sets of security parameters (SP),
one and two rounds. Finally, the prediction requires both ensuring 128-bit security: sp1: (N = 214 , Q =
the exchange of one ciphertext between a DP and the 2438 , η = 3.2, number of levels L = 9, scale ∆ = 234 ,
querier and one DKeySwitch(·) operation, i.e., 2 cipher- degree of the approximated activation function d = 5)
texts are sent per DP. and sp2: (N = 213 , Q = 2218 , η = 3.2, L = 6, ∆ =
Computation Complexity. spindle’s most intensive 230 , d = 3). sp2 is sufficient for linear regression and
computational part is map; its complexity depends lin- for specific logistic regression models that accept a low-
You can also read