Popcorn: Paillier Meets Compression For Efficient Oblivious Neural Network Inference

Page created by Jane Ball
 
CONTINUE READING
Popcorn: Paillier Meets Compression For Efficient Oblivious Neural Network
                                                                             Inference

                                                                     Jun Wang1 , Chao Jin2 , Souhail Meftah2 , Khin Mi Mi Aung2
                                                                                           1 Post, Luxembourg
                                                                             2 Institute for Infocomm Research, A*STAR

                                                                         junwang.lu@gmail.com, jin_chao@i2r.a-star.edu.sg
arXiv:2107.01786v2 [cs.CR] 12 Jul 2021

                                                                  Abstract                                      to the public. NN-IaaS is an important business paradigm, in
                                                                                                                which a server provides machine learning prediction APIs,
                                         Oblivious inference enables the cloud to provide neural net-
                                                                                                                built upon its pre-trained machine learning model, to the
                                         work inference-as-a-service (NN-IaaS), whilst neither disclos-
                                                                                                                clients. The latter uploads her data to the server and receives
                                         ing the client data nor revealing the server’s model. However,
                                                                                                                predictions by calling the APIs.
                                         the privacy guarantee under oblivious inference usually comes
                                         with a heavy cost of efficiency and accuracy.                             Usually, business and privacy protection requirements may
                                            We propose Popcorn, a concise oblivious inference frame-            prevent the server from providing any information other than
                                         work entirely built on the Paillier homomorphic encryption             prediction results. Similarly, the clients are protective of their
                                         scheme. We design a suite of novel protocols to compute                private data and cannot disclose it to the server. This dilemma
                                         non-linear activation and max-pooling layers. We leverage              significantly limits the use of cloud-based services, leading to
                                         neural network compression techniques (i.e., neural weights            an urgent need for privacy-preserving NN-IaaS.
                                         pruning and quantization) to accelerate the inference compu-              Homomorphic encryption (HE) [10], Garbled Circuits
                                         tation. To implement the Popcorn framework, we only need               (GC) [3] and secret sharing (SS) [2] are the workhorse driv-
                                         to replace algebraic operations of existing networks with their        ing many exciting recent advances in oblivious neural net-
                                         corresponding Paillier homomorphic operations, which is ex-            work inference [11, 16, 21, 30]. CryptoNets [11] and its vari-
                                         tremely friendly for engineering development.                          ant [6] adopted HE to support privacy-preserving neural net-
                                            We first conduct the performance evaluation and compari-            work predictions. Since HE operations are constrained to
                                         son based on the MNIST and CIFAR-10 classification tasks.              addition-and-multiplications, the non-linear activation func-
                                         Compared with existing solutions, Popcorn brings a signif-             tion (e.g., relu(x, 0)) is substituted with a low-degree poly-
                                         icant communication overhead deduction, with a moderate                nomial; and the max-pooling function is replaced by the
                                         runtime increase. Then, we benchmark the performance of                mean-pooing function. This approach requires modifying the
                                         oblivious inference on ImageNet. To our best knowledge, this           original model architecture, significantly impacting accuracy.
                                         is the first report based on a commercial-level dataset, taking        XONN [30] exploited the fact that the XNOR operation is free
                                         a step towards the deployment to production.                           in the GC protocol [17] to efficiently evaluate binarized neural
                                                                                                                networks. This approach, however, requires both the weights
                                                                                                                and activation values to be binarized, which harms accuracy
                                         1   Introduction                                                       performance. Moreover, completely compiling a neural net-
                                                                                                                work into circuits increases communication overhead. Min-
                                         Deep convolutional neural networks have achieved tremen-               iONN [21] and Gazelle [16] combined different primitives to
                                         dous success in various domains such as facial recogni-                keep neural networks unchanged. In their methods, the GC is
                                         tion [33], medical diagnosis [9], and image classification [13],       often adopted to compute non-linear layers. It is worth high-
                                         whilst demonstrating beyond-experts performance. The mas-              lighting that existing methods always leak some information
                                         sive (labeled) training data and extensive computational re-           beyond the predictions. For example, in CryptoNets, the client
                                         sources are the fuel for breakthroughs in accuracy. However,           can make inferences about the model, as it would have to gen-
                                         it also becomes a notorious challenge for individuals and non-         erate parameters for the encryption according to the model
                                         specialised institutions to train and deploy state-of-the-art          architecture. In XONN and MiniONN, the client can learn the
                                         models. Thanks to the advances in cloud computing, com-                exact network architecture. Gazelle only reveals the number
                                         panies with sufficient computing power and expertise (e.g.,            of neurons of each neural layer. Though there is always some
                                         Google and Amazon) can provide machine learning services               form of information leakage, it is believed that less leakage

                                                                                                            1
is more preferable [16, 30]. Engineering complexity is also            2.1.1    Linear Layers
an important problem to be addressed. For instance, Gazelle
                                                                       Linear operations in networks are often carried in fully-
relies on sophisticated packing schemes. Its implementation
                                                                       connected layers ( f c) and convolutional layers (conv).
depends on specific network parameters. Sometimes, it also
                                                                          The conv layer is composed of a 3D tensor input (in the
requires performing trade-offs between efficiency and privacy.
                                                                       form of R(wi ,hi ,ci ) ), a set of 3D tensor filters (in the form of
XONN needs to compile a whole network into circuits, the
                                                                       R( fw , fh , fc ) , s.t. fw < wi , fh < hi , fc = ci ) to extract local fea-
effort is non-trivial; the computational cost is also completely
                                                                       tures from the input and a 3D tensor output (in the form of
transferred from the server to the client.
                                                                       R(wo ,ho ,co ) , co is the number of channels of the output, which
   In this paper, we introduce Popcorn, a concise oblivious
                                                                       is also the number of filters). Each channel of the output is
inference framework that is entirely built upon the Paillier
                                                                       obtained by a filter that convolves the input along the direction
homomorphic encryption scheme [27]. Popcorn is a non-
                                                                       of wi and hi (i.e., the width and height of the input). Specif-
invasive solution which does not require modifications to net-
                                                                       ically, each element of a channel is calculated through filter
work architectures (e.g., approximating non-linear activation
                                                                       point-wisely multiplicating its perceptive field of the input and
functions with polynomials). The security model of Popcorn
                                                                       accumulating. The f c layer is a matrix-vector multiplication
is consistent with Gazelle [16], i.e., this protocol hides the
                                                                       as follows,
network weights and architecture except for the number of
                                                                                                         y = W·x+b                              (1)
neurons of each layer. The main contribution of this paper
consists of three aspects,                                             where x ∈ Rm×1 is the input vector, y ∈ Rn×1 is the
                                                                       output,W ∈ Rn×m is the weight matrix and b ∈ Rn×1 is the
    • We introduce a suite of Popcorn protocols for efficiently        bias vector. In fact, a conv layer can be also written in a form
      computing non-linear neural network layers (e.g., relu           of matrix-vector multiplication, as defined in Eq. (1).
      activation layer and max-pooling layer).                            In addition to conv and f c layers, the batch-normalization
                                                                       (bn) [15] layer can be seen as a linear layer. It is often adopted
    • Under the Popcorn framework, we leverage network
                                                                       to normalize the output of a linear layer by re-centering and
      compression (e.g., weight pruning and quantization) to
                                                                       re-scaling as follows,
      accelerate the computation of linear layers (e.g., convo-
      lutional layer and fully-connected layer).                                               x − µB
                                                                                       y = γ· q           +β
    • We benchmark the oblivious inference performance on                                         δ2B + ε
                                                                                                γ            γ · µB                           (2)
      the ImageNet dataset, using state-of-the-art models. To                            =q            ·x−(q          − β)
      our best knowledge, this is the first report on privacy-                                δ2B + ε         δ2B + ε
      preserving ImageNet-scale classification tasks.
                                                                       where µB and δ2B are the per-dimension mean and variance of a
   Compared with existing solutions, an important contribu-
                                                                       mini-batch, respectively. ε is a is an arbitrarily small constant
tion of Popcorn is that its engineering is extremely simple.
                                                                       added in the denominator for numerical stability. The γ and β
The framework completely relies on the Paillier HE scheme.
                                                                       are the re-scaling and re-centering parameters subsequently
In the implementation, we only need to substitute the alge-
                                                                       learned in the optimization process. In the inference phase,
braic operations in plaintext inference to the corresponding
                                                                       all the parameters above are constant. Therefore, the bn com-
HE operations, making (machine learning) engineers agnostic
                                                                       putation is a linear computation in the inference and can be
to the obscure cryptography knowledge. Although the frame-
                                                                       easily absorbed in its previous layer and next layer [14].
work is built on the Paillier scheme in this paper, we can
directly adopt other HE schemes. For example, in the case
where a client has a number of images (e.g., >1000) to clas-           2.1.2    Non-linear Layers
sify, we can also adopt a lattice-based additive HE scheme             The activation layer and pooling layer are two common non-
(e.g., [16]) to pack multiple images into one ciphertext, amor-        linear layers. The activation layer performs non-linear trans-
tizing the computational and communicational overhead.                 formation for its inputs element-wise. It does not change the
                                                                       input dimension. The relu (i.e., max(x, 0)) activation function
2     Preliminaries                                                    is widely adopted as the non-linear transformation function.
                                                                       Different from the activation layer, the pooling layer is for
2.1    Convolutional Neural Network                                    feature dimension reduction. The max-pooling (mp) is a pop-
                                                                       ular pooling method, it outputs the maximal value of each
A typical convolutional neural network (CNN) consists of a             local perception area of its inputs. Suppose the local percep-
sequence of linear and non-linear layers and computes clas-            tion area has m inputs, the max-pooling outputs the result of
sifications in the last layer. In this section, we describe the        max({x0 , x1 , · · · , xm }). There is also another pooling opera-
functionality of neural network layers.                                tion, i.e., mean-pooling, it outputs the average value of each

                                                                   2
local perception area. The mean pooling layer can be treated                 1. Choose two large prime numbers p and q randomly and
as a special simplified convolution layer with a filter in which                independently, such that gcd(pq, (p − 1)(q − 1)) = 1.
the weights share the same value. In practice, the mean pool-
ing can be further simplified and replaced with sum-pooling,                 2. Compute n = pq and λ = lcm(p − 1, q − 1), where lcm
which outputs the sum of the perception area.                                   means least common multiple.

                                                                             3. Obtain public key pk = n, g; private key sk = p, q, λ.
2.2    Neural Network Compression                                               g ∈ Z∗n2 is a multiple of n.

To accelerate neural network predictions and minimize net-                   • HEnc (Encryption: JmK := HEnc(m, pk) ).
work size, the compression technique is developed to discover
and remove unimportant weights (i.e., network pruning), and                  1. Let m be a message to be encrypted, where 0 ≤ m < n.
to present weights with fewer bits (i.e., weights quantization),
                                                                             2. Select a random r ∈ Z∗n , s.t. gcd(r, n) = 1.
without noticeably decreasing accuracy performance [12, 36].
   Network Pruning. Weight pruning and filter pruning (a.k.a                 3. Compute the ciphertext of m: JmK ← gm rn mod n2 .
channel pruning) are two main network pruning methods [36].
The former investigates the removal of each individual weight                • HDec (Decryption: m := HDec(JmK, pk)).
(fine-grained pruning). Normally, the weights that have a
                                                                             1. Let JmK be a cipher to be decrypted, where JmK < n2 .
small magnitude or contribute less to the network loss func-
tion are removed first [12, 20]. The latter investigates remov-                                                                 L(JmKλ mod n2 )
                                                                             2. Compute the plaintext of JmK: m ←                L(gλ mod n2 )
ing entire filters (coarse-grained pruning). Usually, filters that
                                                                                                        x−1
frequently generate zero outputs after the relu layer are re-                   mod n, where L(x) =      n .
moved first [36]. Generally, the weight pruning approach can
remove more weights than the filter pruning approach does,                   • HAdd (The HE addition)
but the filter pruning methods can remove more neurons (we                   1. m1 + m2 = HDec(Jm1 K ⊕ Jm2 K, sk). ⊕ is the HE addition
call each element of any output layer a neuron). Though both                    operator.
the pruning approaches can benefit the Popcorn framework,
we focus on the weight pruning methods in this paper.                        • HAdd (The HE multiplication)
   Weight Quantization. The quantization methods aim to re-
duce the number of bits to represent neural weights, which can               1. m1 × m2 = HDec(Jm1 K ⊗ m2 , sk). ⊗ is the HE multipli-
be roughly categorized into two approaches: floating-point                      cation operator.
preserving [12] and integer-based [18]. In the former, the
weights are quantized into a small number of bins (e.g., 128),           3     Popcorn: Fast Non-linear Computation
all the weights in the same bin share the same value. Thus, for
each weight, we need to store only a small index into a table            In this section, we present our fast methods for non-linear
of shared weights. In the latter, the weights are first quantized        layer computations. The computation protocols are built upon
into integer representations and then recovered into a more              the Paillier HE scheme [27].
expressive form (e.g., floating-point or more levels) through
a de-quantization process, during the inference phase [18].
Binarized quantization is an extreme case, where the weights             3.1      Fast Secure relu Computation Protocol
or activation values are represented in 1-bit [29], with a cost of
accuracy loss. In this work, we leverage the floating-point pre-             Problem 1. Given a vector JxKm×1 which is element-
served approach and binarized neural networks to accelerate                  wisely encrypted under the Paillier HE scheme (where
oblivious neural network inferences.                                         xi ∈ Z∗n ), the server computes relu(JxKm×1 ), without leak-
                                                                             ing the information of any element xi . In this setting, the
2.3    Paillier Homomorphic Encryption                                       client holds the secret key sk, the server holds the public
                                                                             key pk but cannot access to the sk.
Homomorphic encryption (HE) is a form of encryption that
allows computations to be carried over ciphertexts without                  We describe the secure computation of relu activation layer
decryption. The result, after decryption, is the same as if the          in Problem 1, presenting the inputs in form of vector. The relu
operations had been performed on the plaintexts. The Paillier            layer performs non-linear transformation on each element of
Cryptosystem [27] is a well-developed additive HE scheme.                its inputs, i.e., {max(xi , 0)}n−1
                                                                                                        i=0 . To solve the Problem 1, we
We describe the Paillier HE scheme in the form of { KeyGen,              start from a simple multiplicative-obfuscation protocol for
HEnc, HAdd, HMul, HDec }.                                                each xi (named Version one), which is outlined in Fig. 1. For
   • KeyGen (Generate keys: (pk, sk) ).                                  denotation succinctness, we omit the subscript of xi and we

                                                                     3
?
          ↓    Server (JxK, pk)                                              as x = 0). In the rest of this subsection, we first introduce the
                                                                                                                                 ?
          1    τ ←$ Z+n s.t. gcd(τ, n) = 1                                   method to hide the sign of x then to disguise x = 0.
          2    JyKs := JxK τ
                                                         JyKs
          3                                              −−→                 3.1.1     Hide the Sign of x
          ↓     Client (JyKs , pk, sk)
                                                                             The root cause for revealing the sign information of x is that
          4      ys :=HDec(JyKs , sk)
                                                                             the blind factor τ is restricted to be positive (i.e., τ ∈ Z+
                                                                                                                                         n ). To
          5      yc := ys if ys > 0 else 0
                                                                             hide the sign information of x, we revise version one of the
          6      JyKc := HEnc(yc , pk)
                                                         JyKc                protocol (see Fig. 1) to allow the blind factor τ to be a nonzero
          7                                              −−→                 integer (e.g., τ ∈ Z∗n , s.t. gcd(τ, n) = 1). We outline the new
          ↓    Server (JyKc , pk)                                            version, named Version two, in Fig. 2.
          8    JyKs := JyKc τ−1
                                                                                               Server (JxK, pk)
Figure 1: Secure relu computation (Version one). τ−1 is the multi-                       1     τ ←$ Z∗n s.t. gcd(τ, n) = 1
plicative inverse of τ. n is a large modulo.                                             2     JyKs := JxK τ
                                                                                                                                JyKs
                                                                                         3                                      −−→
assume the implementation of Paillier HE supports encoding                               ↓     Client (JyKs , pk, sk)
of of negative integers.                                                                 4       ys :=HDec(JyKs , sk)
   In the Version one, the server first samples a random                                 5       yc := ys if ys > 0 else 0
positive integer τ ∈ Z+  n , then it blinds the JxK with τ (i.e.,                        6       JyKc :=HEnc(yc , pk)
JyKs := JxK τ), and sends it to the client. After receiving                                                                     JyKc
                                                                                         7                                      −−→
the JyKs , the client decrypts JyKs , and returns the server with                        ↓     Server (JyKc , pk)
Jmax(ys , 0)K. The server removes the τ by multiplying τ−1 ,                             8     JyKs := JyKc τ−1
where τ−1 the inverse of τ. We emphasize that τ is indepen-                              9     if τ < 0:
dently and randomly sampled for each element x.                                          10       JyKs := JxK − JyKs
   If yc = 0 (line 5), the correctness proof is exactly the same
of the decryption of Paillier HE scheme. Therefore, we focus
                                                                                      Figure 2: Version two: hide the sign information of x.
the correctness proof on JyKs := JyKc τ−1 (line 8), where
yc 6= 0, i.e., the decryption of JyKs is x. Clearly, if we can
                                                                                Compared to the protocol Version one (Fig. 1), the opera-
eliminate the random factor τ when performing decryption, it
                                                                             tions at the client side remain the same. The key difference
is a normal Paillier decryption process(see Section 2.3, HDec)
                                                                             is that when the server receives the response (i.e., JyKc ) from
and the correctness is guaranteed.
                                                                             the client, the server computes the relu activation depending
Correctness Proof of Version one. To decrypt JyKs , we need                  on the blind factor τ. If τ > 0, the relu calculation is the same
to compute (JyKλs mod n2 ) as follows                                        with version one of the protocol; if r < 0, it means the client al-
                                                                             ways provides the server the opposite information (e.g., when
                        −1 λ
    JyKλs = (gy rn )τ               mod n2                                   x > 0, the client returns J0K.). Therefore, the server computes
                   −1          −1                                            the relu in an opposite way, JyKs := JxK − JyKc τ−1 . If τ > 0,
         = gλxττ rλnτ                 mod n2                                 the proof is the same as Version one. Here, we focus the proof
                   −1
         ≡ gλxττ           mod n2                                            of Version two on τ < 0.
                               −1
         = (1 + n)λxττ                mod n2                                 Correctness Proof of Version two.
                           −1                2      −1                       If x > 0, the client returns JyKc = J0K, the server computes
         = 1 + nxλττ                 mod n       (.ττ    ≡1     mod n)
                                                                             JyKs := JxK − J0K τ−1 , obtaining JyKs = JxK as the expecta-
                                       2
         = 1 + nxλ mod n                                                     tion (i.e., max(x, 0)).
                                                                             If x ≤ 0, the client returns JyKc = Jx · τK (line 5, ys = x · τ), the
As above, the random factor τ has been eliminated by its                     server computes JyKs := JxK − Jx · τK τ−1 , getting JyKs = J0K
multiplicative inverse τ−1 , the rest is exactly the same with               as the expectation. The correctness is established.
the decryption of the Paillier HE scheme. The decryption of
JyKs is x, the correctness is established.                                                       ?
                                                                             3.1.2     Hide xi = 0
  As the proof above, with protocol Version one , the server
correctly gets max(JxK, 0) and learns nothing about x. How-                  Straightforwardly applying the protocol Version two (see 2)
                                                                                                    ?
ever, since τ ∈ Z+n , the client learns two pieces of information            to Problem 1 leaks xi = 0. We address this issue by randomly
about x, (1) the sign of x and (2) whether x = 0 or not (denoted             shuffling the input elements feeding to the relu layer. After the

                                                                         4
shuffling, the client cannot trace where a value was originally                  ↓      Server (JxKm×1 , pk)
placed (i.e., un-traceable), thus to hide whether the true value                 1      γ ←$ Z∗ . seed for shuffling
of a specific slot is zero or not. By this, the client only learns               2      Jx0 Km×1 ← Π(JxKm×1 , γ)
the number of zero values, which can be hidden by adding                         3      f oreach Jxi0 K ∈ Jx0 Km×1 :
dummy elements. Formally, we describe the un-traceability                        4         τi ←$ Z∗n s.t. gcd(τi , n) = 1
of xi ∈ xm×1 in the Definition 1.                                                5         Jyi Ks := Jxi0 K τi
                                                                                 6      end
                                                                                                                              JyKm×1
   Definition 1. The location of xi ∈ xm×1 is un-traceable                       7                                            −−−s−→
   if an observer is unable to distinguish yi ∈ ym×1 from a                      ↓      Client (JyKm×1
                                                                                                     s    , pk, sk)
   random, where yi = xi0 · τi and xi0 is the value at slot i of                 8      f oreach Jyi Ks ∈ JyKm×1  s  :
   xm×1 , not xi itself (through randomly shuffling).                            9         yi :=HDec(Jyi Ks , sk)
                                                                                 10        yi := yi if yi > 0 else 0
  To achieve the un-traceability, the server needs a pair of                     11        Jyi Kc :=HEnc(yi , pk)
uniform-random shuffling functions (Π, Π−1 ) as follows,                         12     end
                                                                                                                              JyKm×1
                   Jx0 Km×1 ← Π(JxKm×1 , γ)                                      13                                           −−−c−→
                                                                 (3)             ↓      Server (JyKm×1        , pk)
                   JxKm×1 ← Π−1 (Jx0 Km×1 , γ)                                                          c
                                                                                 14     f oreach Jyi Kc ∈ JyKm×1      c :
where the γ is a private random seed.                                            15        i f τi > 0:
    Specifically, given an input vector JxKm×1 , the server first                16            Jxi0 K := Jyi Kc τ−1 i
adds t dummy elements (of which a random portion are set                         17        else:
to 0) to the input vector, and gets JxK(m+t)×1 . This step is to                 18            Jxi0 K := Jxi0 K − Jyi K τ−1
                                                                                                                          i
hide the number of elements with value of 0. For denotation                      19     end
succinctness, we let m = m +t, i.e., we continue to denote the                   20     JxKm×1 ← Π−1 (Jx0 Km×1 , γ)
new input vector as JxKm×1 . The second, the server samples                      21     . trim dummies in JxKm×1
a private random seed γ and applies the random shuffling
function Π to the input vector JxKm×1 , obtaining Jx0 Kn×1 . The            Figure 3: Secure relu computation (Complete version). We assume
third, the server blinds each Jxi0 K ∈ Jx0 K with a independently           the JxKm×1 already contains dummies.
and randomly sampled integer τ ∈ Z∗n s.t. τ 6= 0 ( Jyi Ks =
Jxi0 K · τi , see Fig. 2). With the random shuffling and one-time
random mask (i.e., τi ), it is clearly that yi is indistinguishable            In this section, we will show how to efficiently compute
from a random value (i.e., xi ∈ xn×1 is un-traceable), thus                 the mp layer. We first introduce a new protocol to compute
to hide the sign information of xi . We emphasize that, after               max(Jxi K, Jx j K), which is the fundamental building block for
applying the shuffling function Π to the input array xm×1 , xi0             computing the mp layer. Then, we leverage the Jxi K Jx j K
is the value at slot i (yi = xi0 · τi ), not xi itself.                     inherent in the max(Jxi K, Jx j K) protocol to reduce homomor-
    Now, we have introduced the completed version of our                    phic computations. Lastly, we present a method to absorb the
secure relu computation protocol, outlined in Fig. 3.                       relu computation into the mp layer.

                                                                            3.2.1     Compute max(Jxi K, Jx j K)
3.2    Efficient Max-Pooling (mp)
                                                                            The max(Jxi K, Jx j K) is the fundamental computation for the
                                                                            max-pooling operation. We construct a lightweight protocol
  Problem 2. Given a matrix JXKm×m which is element-
                                                                            to compute max(Jxi K, Jx j K). The basic idea is first to covert it
  wisely encrypted under the Paillier HE scheme ( xi j ∈
                                                                            to a comparison between a ciphertext and 0 (i.e., max(JxK, 0)),
  Z∗n ), the server computes mp(JXKm×m ,t, s), without leak-
                                                                            then recover the result from the client’s response, which is
  ing the information of any element xi j . t is the pooling-
                                                                            similar to the secure relu computation protocol. We summa-
  window dimension and s is the stride, where s ≤ t < m
                                                                            rize the max(Jxi K, Jx j K) computation protocol in Fig. 4.
  (usually, t = s = 2). The client has sk, the server has pk
                                                                               For each pooling window, the max-pooling operation needs
  but cannot access to sk.
                                                                            m − 1 comparisons and calls the max(Jxi K, Jx j K) protocol
   The max-pooling (mp) operation outputs the maximum                       dlog2 (n)e times. The same with the relu protocol (see Fig. 3),
value of each pooling window (in size of t × t). For succinct-              we adopt the random shuffling method to avoid leaking
                                                                                     ?
ness, we denote the pooling window in form of vector, i.e.,                 xi − x j = 0. Firstly, the server randomly maps the elements
max({Jx0 K, Jx1 K, · · · , Jxm K}), m = t × t. In existing solutions,       of each pooling window into pairs; then, the server shuffles
e.g., [16, 21, 30], the mp operation is notoriously inefficient.            all the pairs from all the pooling windows (each pair as an

                                                                        5
↓          Server ((Jxi K, Jx j K), pk),                               where auv and buv denote the weights of the two conv fil-
          1          τ ←$ Z+n s.t. gcd(τ, n) = 1
                                                                                 ters, respectively. As shown in Equation (4), for two adja-
          2          JyKs := (Jxi K Jx j K) τ                                    cent conv operations, we only need to apply one homomoh-
                                                               JyKs              phic multiplication-and-addition for each element of the over-
          3                                                −−→                                                          in K).
                                                                                 lapped region (i.e., by (auv − buv )Jxuv
          ↓          Client (JyKs , pk, sk)
                                                                                    In general, suppose the conv window size is w and the stride
          5            ys :=HDec(JyKs , sk)                                      is s. Thanks to the Jxi K Jx j K, we can reduce the number
          6            yc := ys if ys > 0 else 0                                 of homomorphic multiplication-and-additions (of two conv
          7            JyKc :=HEnc(yc , pk)                                      operations) from 2w2 to w(w − s) + 2ws = w2 + ws. The ratio
                                                               JyKc                                                   2 +ws
          8                                                −−→                   of computation reduction is 1 − w2w      2
                                                                                                                                        s
                                                                                                                               = 0.5 − 2w . In most
          ↓          Server (JyKc , pk)                                          CNNs, s ∈ {1, 2} and w ∈ {3, 5, 7, 11}. Therefore, nearly 50%
          9          JyKs := Jx j K + JyKc     τ−1                               of homomorphic multiplication-and-additions can be reduced
                                                                                 when meeting a mp layer.
Figure 4: Secure max(Jxi K, Jx j K) computation protocol. Jxi K, Jx j K
are randomly selected from a pooling window, such that we can                    3.2.3   Compute relu → mp
sample the binder factor τ from Z+ .
                                                                                 Usually, a mp layer often directly follows a relu layer, i.e.,
                                                                                 relu → mp [13, 19]. For each pooling window, we can write
unit); the third, the server executes the secure max(Jxi K, Jx j K)
                                                                                 the computation of relu → mp as follows,
computation protocol to get the maximum of each pair.
                                                                                          max(max(x1 , 0), max(x2 , 0), · · · , max(xm , 0))    (5)
3.2.2   Jxi K         Jx j K Benefits Computation
                                                                                 Computing the Equation (5) costs 2n − 1 comparisons.
The secure max(Jxi K, Jx j K) computation protocol contains a                    Clearly, we can transform this equation into following form,
subtraction between two ciphertexts (i.e., Jxi K Jx j K), we can
exploit this fact to reduce computations on encrypted data.                                max(max(x1 , 0), max(x2 , 0), · · · , max(xm , 0))
                                                                                                                                                (6)
                                                                                         = max(max(x1 , x2 , · · · , xm ), 0)

                                                                                 Compared with Equation (5), Equation (6) reduces the num-
                                                                                 ber of comparisons from 2m − 1 to m. From the perspective
                                                                                 of communication overhead, by this transformation, we can
                                                                                 compute the max-pooling layer for free.

                                                                                 3.2.4   Summary
                                                                                 Computing the mp layer often leads to expensive communica-
                                                                                 tion overhead in existing methods [16, 21]. A commonly-seen
                                                                                 trick is to reduce the use of max-pooling layers such as using
                                                                                 mean-pooling layers instead. However, this approach may
Figure 5: An illustration of two adjacent conv operations, with stride           results in a risk of accuracy degradation, as it breaches the
s = 1 and conv window size w = 3. xi = convi , x j = conv j . x∗in indi-         original design.
cates the element of the conv inputs.                                               In this section, we first present a fast method for computing
                                                                                 the max-pooling layer, of which the communicational over-
   As shown in Fig. 5, Jxi K = convi = ∑3u=1 ∑3v=1 ai j Jxiinj K and             head is equivalent to computing a relu layer. Then, we exploit
Jx j K = conv j = ∑3u=1 ∑4v=2 ai j Jxiinj K are the results of two ad-           the stable design pattern in CNN, i.e., conv → relu → mp,
jacent conv operations. Each conv costs w2 homomorphic                           to further reduce the computational cost, and to absorb the
multiplication-and-additions (here, w = 3). Instead of inde-                     computation of relu into mp (looks like we can compute the
pendently computing convi and conv j , we leverage the com-                      mp layer for free). It means that, in the Popcorn framework,
putation of Jxi K Jx j K to reduce HE operations as follows,                     the mp layer can be a factor to improve efficiency, instead of
                                                                                 becoming a heavy burden as usual.
                                  3   3                3   4
         Jxi K        Jx j K =   ∑ ∑ ai j Jxiinj K ∑ ∑ ai j Jxiinj K             4   Popcorn: Fast Linear Computation
                                 u=1 v=1              u=1 v=2
                                                                       (4)
                 3     3                          3             3
                                                                                 In the Popcorn framework, the input of each layer is element-
         =    ∑ ∑ (auv − buv )Jxuvin K ⊕ ∑ xu1
                                            in    in
                                               ∑Jxu4 K
                                                                                 wisely encrypted (the client data is the first layer input). Based
              u=1 v=2                             u             u

                                                                             6
on this fact, we can exploit neural network compression tech-                          weight in different conv layers may connect to a different
nologies to reduce HE computations in linear layers (i.e., the                         number of ciphertexts. This means that removing the same
conv and f c layers), speeding up the oblivious inference. In                          number of weights from different conv layers can result in a
this section, we introduce fast linear computation methods,                            different reduction of homomorphic operations, depending
which rely on pruned-and-quantized networks and binarized                              on the input dimension w of each conv layer. The f c layer is
neural networks, respectively. Usually, the former preserves                           a vector-matrix multiplication, each weight connects to one
accuracy well; the latter leads to higher efficiency.                                  ciphertext. Therefore, for conv layers, it is suggested to first
                                                                                       remove weights in a layer that has a larger input dimension.
4.1      Pruning-Then-Quantization                                                        Weight Quantization. The purpose of weight quantization
                                                                                       is to reuse the intermediate results computed between weights
Suppose there are two filters (represented in a form of vec-                           and encrypted inputs, as much as possible. Hence, the quanti-
tor) a = [a1 , a2 , · · · , an−1 , an ], b = [b1 , b2 , · · · , bn−1 , bn ], and       zation priority of each layer depends on the number of weights
an encrypted input JxK = [Jx0 K, Jx1 K, · · · , Jxn−1 K, Jxn K]. Before                that a ciphertext connects to, i.e., the larger the number, the
performing the conv operations, we can adopt network prun-                             higher the priority of the layer is. The weights of a high prior-
ing techniques to discover and remove un-important weights                             ity layer should first be quantized into lower bit representation.
of a, b (e.g., let a2 = 0, an−1 = 0, b1 = 0), and employ net-                                                                                  c f2
                                                                                       For a conv layer, each ciphertext connects to os2w weights,
work quantization methods to force multiple weights of a, b
                                                                                       where co is the number of filters and fw is the filter dimension.
to share the same value (e.g., a1 = b1 ). We illustrate how to
                                                                                       Assume the weights are represented in t bits, the reuse ratio is
use the prunned-and-quantized conv filters to reduce homo-                                      2
morphic computations as follows,                                                       ≥ sc2o·2t m . For a f c layer, each ciphertext connects to co weights,
                                                                                       here we re-define co as the f c layer output dimension. The
                                                                                       reuse ratio is ≥ 2cmo . With a quantized network, we can limit
                                                       Jx0 K                           the number of homomorphic computations to O(n·2m ), where
                                                       Jx1 K                           we abuse n to denote the number of ciphertexts. When the
             0     a1    a2     ···     0       an     Jx2 K                           c0 and fw are large while the m and s are small, the reduced
             b0    b1    0      ···    bn−1     bn      ···                            homomorphic computations are significant.
                                                      Jxn−1 K              (7)
                                                       Jxn K
                0 + a1 Jx1 K + a2 Jx2 K + · · · + 0 + an Jxn K                         4.1.2   Summary
         =
             b0 Jx0 K + 0 + 0 + · · · + bn−1 Jxn−1 K + an Jxn K
                                                                                       Based on the analysis above, we suggest a layer-by-layer
  Firstly, we can simply skip any computations related to the                          pruning-then-quantization approach to speed up the linear
removed weights (which are permanently set to 0). Then, we                             layer computations in the Popcorn framework. This approach
can find out which weights connect to the same ciphertext and                          starts from pruning layers with the largest input dimensions
share the same value, thus reusing the intermediate results. For                       or with the most number of multiplication-and-accumulation
example, if ai = bi (where 0 ≤ i ≤ n) we can reuse the result                          operations. It can be implemented through the algorithm intro-
of ai Jxi K when computing bi Jxi K. Obviously, the inference                          duced by [36]. After the pruning process, we can quantify the
acceleration relies on the number of removed weights (by                               remaining non-zero weights into low bits, beginning with the
pruning) and reused intermediate results (by quantization).                            layers in which each ciphertext connects to the most non-zero
This observation straightforwardly applies to the f c layer.                           weights. For a specific layer, we can adopt the codebook-
                                                                                       based method introduced in [12]. Compared with other meth-
4.1.1    Network Compression For Popcorn                                               ods, this method can obtain a lower-bit representation while
A number of network compression methods have been pro-                                 ensuring the same accuracy.
posed for different purposes such as minimizing model
size [12], reducing energy consumption [36]. However, all the
exiting methods are designed for plaintext inputs. There is a                          4.2     Binarized Neural Network
lack of investigation for network compression for ciphertext
inputs. In this section, we analyze the weight pruning and                             The binarized neural network is an extreme case of net-
quantization methods that are fit for the Popcorn framework.                           work quantization, of which the weights are binarized, i.e.,
   Weight Pruning. The aim of weight pruning is to further                             {−1, +1}. In this section, we introduce how to efficiently
reduce the number of homomorphic computations, by remov-                               evaluate a binarized network in the Popcorn framework.
ing more weights. For a conv layer, each filter weight con-                               Assume there are two binarized filters (in the form of vec-
           2
nects to ws2 ciphertexts, where w is the input dimension of the                        tor) a = [+1, −1, · · · , −1, +1], b = [−1, −1, · · · , +1, −1], and
layer, s is the stride-size of the conv operations. So a single                        an encrypted input JxK = [Jx0 K, Jx1 K, · · · , Jxn−1 K, Jxn K]. We de-

                                                                                   7
scribe the conv (as well as the f c) computation as follows,            to 2048 bits. We execute the comparison and benchmarks
                                                                        on the MNIST, CIFAR-10, and the ILSVRC2012 ImageNet
                                                                        dataset. To the best of our knowledge, this is the first report for
                                            Jx0 K
                                                                        benchmarking oblivious inference on the ImageNet dataset.
                                            Jx1 K
               +1       −1   ···   −1   +1
                                             ···
               −1       +1   ···   −1   −1                              5.2    General Comparison of Prior Arts
                                           Jxn−1 K           (8)
                                            Jxn K                       Existing frameworks are usually efficiency-oriented, with dif-
             +Jx0 K + Jx1 K + · · · + Jxn−1 K − Jxn K                   ferent compromises in terms of privacy and computational
           =
             −Jx0 K − Jx1 K + · · · + Jxn−1 K − Jxn K                   guarantees. For clarification, we describe prior arts and our
                                                                        Popcorn framework according to the following guarantees,
   It is clear that the computations only rely on efficient homo-
morphic additions, avoiding expensive multiplications. There-              • P1 (data privacy). The framework hides the client data
fore, the execution of conv and f c layers can be very efficient.            from the server, except for data dimensions.
In the sub-section 4.2.1, we introduce the "+1 Trick" to fur-
ther improve efficiency.                                                   • P2 (network privacy). The framework hides the server’s
                                                                             network weights (as well the output values of each hid-
                                                                             den layer) from the client.
4.2.1   +1 Trick
We exploit the fact that the weights are binarized, a ∈                    • P3 (network privacy). The framework hides the server’s
{+1, −1}m , to roughly halve the computation cost, through a                 network (including network weights, architecture and
+1 trick as follows,                                                         the output values of each hidden layer) from the client,
                                                                             except for (1) the number of layers; (2) the number of
               ax = (1 + a)x − 1x                                            activations of each layer; (3) classification results.
                        (1 + a)    (1 + a)                   (9)
                    =           x+         x − 1x                          • P4 (network privacy). The framework hides the server’s
                           2          2                                      network from the client, except for classification results.
where 1 indicates a vector in which each element is 1. Two
                                                                           • V1. The framework supports any type of CNNs.
avoid multiplications, we split (1 + a) ∈ {0, +2}m into two
pieces of (1+a)             m
             2 ∈ {0, +1} . For the same layer, we only need                • V2. The framework does not rely on any external party.
to compute 1x once, and the cost can be amortized by all the
conv filters. For the case that there are more +1 than −1, we
                                                                               Framework       P1     P2    P3    P4     V1    V2
can adapt the Equation (9) to ax = − (1−a)         (1−a)
                                            2 x − 2 x + 1x.
Therefore, we can always halve the homomorphic computa-                        CryptoNets      X      X     X     X      −     X
tions by the +1 trick. Different from [32], we don’t need to                   SecureML        X      X     −     −      −     −
binarize the input x, facilitating accuracy preservation.                      MiniONN         X      X     −     −      X     X
                                                                               Gazelle         X      X     X     −      X     X
                                                                               XONN            X      X     −     −      −     X
5     Comparison and New Benchmark                                             Popcorn         X      X     X     −      X     X
In this section, we first conduct a general comparison between                  Table 1: The comparison of different frameworks.
the Popcorn and prior arts regarding the privacy guarantee
and utility (Section 5.2); then we test Popcorn and compare it             As shown in Table 1, all the frameworks can meet the P1
with previous arts in term of efficiency; Lastly, we report the         criteria, i.e., the client data privacy is well preserved. However,
benchmarks of oblivious inference on the ImageNet dataset,              the network information is leaked to different extents within
based on start-of-the-art networks.                                     different frameworks. CryptoNets [11] preserves the most net-
                                                                        work privacy (P2, P3, P4, assume the encryption parameters
                                                                        are large enough), while SecureML [23], MiniONN [21] and
5.1     Evaluation Settings
                                                                        XONN [30] leak the most network information (i.e., P3, P4
We implement the Popcorn framework based on OPHELib                     are leaked). Gazelle [16] and Popcorn achieve a compro-
[25], which provides an implementation of the Paillier en-              mise (P2, P3 are protected). Note that, to improve efficiency,
cryption scheme, written in C++. The code is compiled us-               Gazelle introduced a padded patch, but pay the price of pri-
ing GCC with the ’-O3’ optimization, and the OpenMP for                 vacy, i.e., the conv filter size is disclosed to the client. So
parallel acceleration is activated. The test is performed on            far, it is very challenging to completely protect network pri-
(Ubuntu 18.04 LTS) machines with Intel(R) Xeon(R) CPU                   vacy. For example, in CryptoNets, the activation method and
E5 and 32GB of RAM. The Paillier key size is always set                 network size can be inferred by the encryption parameters;

                                                                    8
in Gazelle and Popcorn, the size of each network layer can             5.3.1    Evaluation on MNIST
be deduced by the number of neurons. Though multiparty-
computation-based frameworks do not completely protect                 The MNIST is an entry-level image classification dataset. It
network privacy, we believe that it is important to hide more          consists of a set of grayscale images of handwritten digits (i.e.,
information, rather than ignore the privacy breach or directly         [0,9]), and the dimension of each image is 28 × 28 × 1. We
disclosing the network information to the client.                      perform the experiments with three classical neural networks
                                                                       which were frequently adopted by previous arts, as shown in
   The Versatility, i.e., the support for various CNNs (V 1) and       Table 2.
the requirements of server settings (V 2), directly impact the
deployments in reality. CryptoNets and SecureML only sup-                 Network       Source                       Decription
port linearized CNNs, which require substituting non-linear                             CryptoNets, XONN
                                                                          NM1                                        3 fc
activation functions with polynomials. This approach may sig-                           Gazelle, MiniONN
nificantly reduce accuracy performance, especially for large                            CryptoNets, XONN
                                                                          NM2                                        1 conv, 2 f c
neural networks. The SecureML needs two non-colluding                                   DeepSecure, Gazelle
servers, which may narrow the applicable scenarios. XONN                                XONN, Gazelle,
                                                                          NM3                                        2 conv, 2 mp, 2 f c
is applicable only when the weights and activations of a net-                           MiniONN
work are binarized. Gazelle, MiniONN, and Popcorn satisfy
                                                                       Table 2: Network architectures for the MNIST dataset. NM stands
both the V 1 and V 2 criteria, as they don’t need to adjust the        for Network on MNIST. NM1 is a MLP network. NM2 and NM3 are
original design of a network. It is worth mentioning that, us-         two small CNNs. In CryptoNets, the relu activation layer is replaced
ing XONN, the client will undertake the most computational             by the square ( f (x) = x2 ) activation layer. For detailed architecture
overhead, as it is responsible for executing the complied net-         information, please refer to the papers listed in the "Source" column.
work circuits to obtain classification results. It may become a
heavy burden to the client when the network becomes large.
                                                                          The "P/Q" presents the pruning ratio (i.e., the number of
                                                                       removed weights divided by the number of total weights) and
                                                                       the average number of bits to represent the overall weights. As
                                                                       shown in Table 3, we can remove more than 90% weights and
5.3    Efficiency Comparison with Prior Arts                           quantize the rest weights to ≤ 6 bits without losing accuracy.
                                                                       Sometimes, the Popcorn-c even results in higher accuracy.
To compare with prior arts, we report runtime (RT), commu-                        P/Q           Framework         RT        COM       acc%
nication bandwidth (COM) and accuracy (acc%) on MNIST                                           MiniONN           1.04      15.8      97.6
and CIFAR-10 classification tasks. Since the Popcorn can                NM1       0.92/6        Gazelle           0.09      0.5       97.6
leverage compressed and binarized networks to accelerate                                        XONN              0.13      4.29      97.6
the oblivious inference. We implement two versions of the                                       Popcorn-b         0.30      0.78      97.6
Popcorn framework, Popcorn-b and Popcorn-c. The former                                          Popcorn-c         0.51      0.78      98.4
supports binarized neural networks (section 4.1) and the latter                                 CryptoNets        297.5     272.2     98.95
supports compressed neural networks ( section 4.2).                                             DeepSecure        9.67      791.0     98.95
  For network binarization, we follow the XNOR-Net                                              MiniONN           1.28      47.6      98.95
method [29] that we only binarize the network weights                   NM2       0.91/5.1      Gazelle           0.29      8.0       99.0
and leave the activation values as the original. By this, we                                    XONN              0.16      38.28     98.64
can preserve the information carried on activations, benefit-                                   Popcorn-b         0.60      1.90      98.61
ing accuracy performance [28]. For network compression,                                         Popcorn-c         0.67      0.84      98.91
we adopt the layer-by-layer pruning-then-quantization ap-                                       MiniONN           9.32      657.5     99.0
proach suggested in Section 4.1.2. Firstly, we adopt the layer-                                 Gazelle           1.16      70.0      99.0
by-layer weight pruning method introduced by [36], to re-               NM3       0.90/5.6      XONN              0.15      62.77     99.0
move more weights in lower layers. Then, we employ the                                          Popcorn-b         7.14      19.50     99.0
deep-compression method [12] to quantize the rest non-zero                                      Popcorn-c         5.71      10.50     98.8
weights, to use lower bits to represent weights in f c layers,         Table 3: Comparison on MNIST. RT means the runtime in second.
and conv layers which contain more filters.                            COM presents the communication bandwidth in megabyte.
   As mentioned above, in the implementation, we build the
network binarization and the pruning-then-quantization ap-                For the binarized model evaluation, we did not adopt the
proaches with existing arts (i.e., [12, 29, 36]), thus avoiding        scaling-factor method introduced by XONN [30]. Instead, for
tedious accuracy performance evaluation and focusing on the            all the networks, first we double the output size of the first
efficiency testing and comparison.                                     layer, the others remain the same; then, we follow the XNOR-

                                                                   9
Net training method [29] to obtain binarized models. There                     conv filters of the first 3 layers to obtain equivalent accuracy
are two main reasons leading us to the current implemen-                       with their full precision version. For VGG-c, we retain the
tation. First, we don’t quantize the activation values, more                   original network architecture. Note that VGG-c has many
information can be carried to preserve accuracy performance.                   more filters than NC1 and NC2. In Popcorn-b, we don’t bi-
Second, the networks for the MNIST classification tasks are                    narize activations. As shown in Table 5, the network NC1
very small (e.g., small input dimensions, 28 × 28 × 1). Instead                leads to a 10% decrease in accuracy. Therefore, simply scal-
of using a complicated method to discover a thin network                       ing up the size of a fully binarized network may not obtain
architecture, we can empirically and efficiently try different                 decent accuracy as expected, instead, it is suggested to use
settings of the neural networks.                                               state-of-the-art networks for classification tasks.
   Regarding the runtime performance, XONN and Gazelle
are in the leading position, the Popcorn follows. For the com-                             P/Q         Framework      RT       COM        acc%
munication overhead, the Popcorn framework shows a signifi-                                            MiniONN        544.0    9272.0     81.61
cant advantage. According to the results, all the frameworks                                           Gazelle        15.48    1236.0     81.61
achieve equivalent accuracy (from 97.6% to 99.0%, see Ta-                       NC1        0.76/6.7    XONN           5.79     2599.0     81.61
ble 3). XONN and CryptoNets require modifying the original                                             Popcorn-b      59.7     125.5      81.66
network design to fit the constraints of the adopted crypto                                            Popcorn-c      84.5     78.4       81.74
primitives. The CryptoNets shows that by substituting the                       NC2                    XONN           123.9    42362      88.0
non-linear activation method (e.g., relu) with a square func-                              0.77/7.1    Popcorn-b      268.2    704.7      88.0
tion, it can still get decent accuracy; XONN demonstrates                                              Popcorn-c      528.8    207.8      87.6
that it is possible to improve accuracy by scaling up the net-                  VGG-c      0.81/6.4    Popcorn-b      513.3    449.8      91.0
work architecture size, remedying the accuracy loss caused                                             Popcorn-c      918.1    449.8      91.7
by the binarizing network weights and activations. However,                    Table 5: Comparison on CIFAR-10. RT means the runtime in second.
a natural question arises, can such tricks be applied to larger                COM presents the communication bandwidth in megabyte.
datasets (e.g., larger input dimensions and network size)?
                                                                                  As shown in Table 5, at the same accuracy level, Popcorn
5.3.2   Evaluation on CIFAR-10                                                 requires much less communication overhead. For example, to
                                                                               reach the accuracy of 88.0%, the communication overhead of
The CIFAR-10 is another popular image classification dataset,                  XONN is around 41 GB, while Popcorn-b only needs 704.7
which consists of a number of colorful images and categorized                  MB, which is 60× smaller. In addition, we use the state-of-
into 10 classes such as bird, truck, cats. The dimension of each               the-art binarized VGG-c, getting an accuracy of 91%, and
image is 32×32×3. Unlike the MNIST, the CIFAR-10 classi-                       communication bandwidth is only 450 MB. This observation
fication tasks require sophisticated design on neural networks.                shows again the importance of using start-of-the-art networks.
The tricks applied to the MNIST for preserving accuracy may
not work on the CIFAR-10. For example, Li et at. [6] investi-
gated different approximation methods to replace non-linear                    5.4    Benchmarks on ImageNet
activation functions, but none of them could obtain decent                     With the promising results on the datasets MNIST and CIFAR-
accuracy. Straightforwardly increasing architecture size also                  10 (especially the communication overhead), we look at
becomes struggling to improve accuracy (see Table 5). We                       a commercial-level dataset, the ImangeNet ILSVRC2012
conduct the experiments with two networks adopted by prior                     dataset. Neural networks for classification on this dataset
arts and a VGG variant for CIFAR-10 [34, 37], summarized                       often have large input dimensions (i.e., 224 × 224 × 3), which
in Table 4.                                                                    is much larger than the input dimension adopted for MNIST
                                                                               ( 28 × 28 × 1) and CIFAR-10 ( 32 × 32 × 3). To the best of
     Network       Source                 Description
                                                                               our knowledge, this is the first report for benchmarking the
                   XONN, Gazelle,
     NC1                                  9 conv, 3 mp, 1 f c                  oblivious inference on ImageNet classification tasks.
                   MiniONN
                                                                                  To evaluate the Popcorn framework, we adopt AlexNet [19]
     NC2           XONN                   10 conv, 3 mp, 1 f c
                                                                               and VGG [34], which are two milestone networks for the
     VGG-c         [37]                   6 conv, 5 mp, 1 f c
                                                                               ImageNet classification tasks, to benchmark the oblivious
Table 4: Networks on CIFAR-10. NC stands for Nnetworks on                      inference. Compared with the networks used for MNIST and
CIFAR-10. VGG-c is a customized network for CIFAR-10 [37]. For                 CIFAR-10 classifications, the most significant difference is
detailed architecture information, please refer to the papers listed in        that the dimensions of input and each hidden layer are much
the "Source" column.                                                           larger. For ease of future comparison, we describe the network
                                                                               architectures in Table 6 and Table 7, respectively.
  For network binarization, we also use the XNOR-Net train-                       For network compression and binarization, we use the same
ing method. For the NC1 and NC2, we double the number of                       method applied to the MNIST and CIFAR-10 classification

                                                                          10
Layer     Input           Kernel    pw     s     P/Q                                    Framework        RT(m)       COM        acc%
     conv1     224, 224, 3     64        11     4     0.83/8                                 Popcorn-b        9.94        560.6      78.1
                                                                                AlexNet
     conv2     55, 55, 64      192       5      1     0.92/7                                 Popcorn-c        29.9        560.6      79.3
     mp        55, 55, 192     −         2      2                                            Popcorn-b        115.4       7391.5     87.2
                                                                                VGG
     conv3     27, 27, 192     384       3      1     0.91/7                                 Popcorn-c        568.1       7391.5     90.1
     mp        27, 27, 384     −         2      2                          Table 8: benchmarks on ImageNet: RT (m) means the runtime in
     conv4     13, 13, 384     256       3      1     0.81/7               minute; COM presents the communication bandwidth in megabyte.
     conv5     13, 13, 256     256       3      1     0.74/7               (Note: this version corrects a naive but significant typo in our previ-
     mp        13, 13, 256     −         2      2                          ous version. Previously, we mistakenly indicated the communication
     f c1      9216            4096      −      −     0.92/6               cost in COM(g), i.e., communication bandwidth in gigabyte. In fact,
     f c2      4096            4096      −      −     0.91/6               it should be COM as in this version, and COM presents the commu-
                                                                           nication bandwidth in megabyte. It is megabyte, not gigabyte.)
     f c3      4096            1000      −      −     0.78/6
Table 6: AlexNet [19]: pw stands for the dimension of the max-
pooling window; s is the stride. P/Q records the pruning ratio and         using XONN, we need to compile the whole network into
the number of bits for weights representation of each layer.               circuits; using Gazelle, large input dimensions require high
                                                                           degree polynomials (the efficiency is also obviously affected).
    Layer     Input             Kernel     pw    s     P/Q                 When we try to evaluate AlexNet using Gazelle, on the same
    conv1     224, 224, 3       64         3     1     0.53/8              machine that the Popcorn runs, the out-of-memory error al-
    mp        224, 224, 64      −          2     2     −                   ways occurs (using the implementation provided by [16]). An
    conv2     112, 112, 64      128        3     1     0.71/8              estimation of executing AlexNet with Gazelle, the communi-
    mp        112, 112, 128     −          2     2                         cation overhead is at least 50 GB, the XONN is even worse.
    conv3     56, 56, 128       256        3     1     0.67/7              Therefore, the Popcorn can have a significant advantage for
    conv4     56, 56, 256       256        3     1     0.65/7              evaluating state-of-the-art networks (e.g., AlexNet and VGG).
    mp        56, 56, 256       −          2     2     −
    conv5     28, 28, 256       512        3     1     0.69/7
                                                                           6    Related Work
    conv6     28, 28, 512       512        3     1     0.73/7
    mp        28, 28, 512       −          2     2     −                   Barni et al. [1] initiated one of the earliest attempts for oblivi-
    conv7     14, 14, 512       512        3     1     0.78/6              ous inference, using homomorphic encryption (HE). Since HE
    conv8     14, 14, 512       512        3     1     0.81/6              is not compatible with non-linear algebraic operations (e.g.,
    mp        14, 14, 512       −          2     2     −                   comparison, division), they introduced a multiplicative obfus-
    f c1      25088             4096       −     −     0.96/6              cation to hide intermediate computation results. However, this
    f c2      4096              4096       −     −     0.96/6              method leaks information about neural network weights [26].
    f c3      4096              1000       −     −     0.79/6              Gilad-Bachrach et al. [11] replaced the non-linear activation
Table 7: VGG [34]. For f c layers, the output dimension is recorded        function (e.g., relu(0, x) [24]) with a low-degree polynomial
in the "Kernel" column.                                                    ( f (x) = x2 ), making the neural network fully compatible with
                                                                           HE. Several works [4–6] developed different methods to im-
                                                                           prove the CryptoNets paradigm, in terms of efficiency and
tasks. Thanks to [36], a pruned AlexNet model already ex-                  accuracy. However, compared with other approaches, the re-
ists, and it was pruned starting from lower layers. So we                  sults are still not promising.
directly use it as the basis, and quantize the remaining non-                  Rouhani et al. [31] proposed garbled circuits based frame-
zero weights through the deep compression method [12]. The                 work for the oblivious inference, where the server compiles
compression results of each layer of AlexNet and VGG are                   a pre-trained network into circuits and sends the circuits to
summarized in the Table 6 and Table 7.                                     a client, the client gets the prediction results by evaluating
   We report the runtime and communication bandwidth in                    the circuits. However, performing multiplication in GC has
Table 8. Compared with MNIST and CIFAR-10, the runtime                     quadratic computation and communication complexity with
is significantly increased, as the network size is orders of               respect to the bit-length of the input operands. This fact rises
magnitude larger than that for MNIST and CIFAR-10. The                     up a serious efficiency problem for a precise inference (which
communication bandwidth is still in a reasonable range, even               often needs a high bit-length of the input operands). Riazi et
lower than the bandwidth required by the prior arts to run                 al. [30] leveraged fully binarized neural networks, of which
the CIFAR-10 classification tasks. It is worth stressing that              the weights and activations are binary (i.e., {+1, −1}), to ac-
the bandwidth complexity of Popcorn is O(n), where n is the                celerate the GC-based oblivious inference. However, fully
number of activations. The bandwidth complexity of XONN                    binarized neural networks are not stable in accuracy perfor-
and Gazelle relies on the network architecture. For example,               mance, especially, when used for classification tasks on large-

                                                                      11
scale datasets (e.g., ImageNet [19]).                                        deep discretized neural networks. In Annual Interna-
   Liu et al. [21] combined HE, GC and SPDZ [8] to speedup                   tional Cryptology Conference, pages 483–512. Springer,
the oblivious inference. It showed a hybrid approach can be                  2018.
promising in both efficiency and accuracy. To balance the
communication overhead and accuracy, Mishra et al. [22] pro-             [5] Alon Brutzkus, Ran Gilad-Bachrach, and Oren Elisha.
posed to replace partial non-linear activations with low-degree              Low latency privacy preserving inference. In Interna-
polynomials, the inference computation is similar to [21].                   tional Conference on Machine Learning, pages 812–821.
However, the SPDZ-based methods directly reveal the net-                     PMLR, 2019.
work architecture to the client. Juvekar et al. [16] leveraged           [6] Edward Chou, Josh Beal, Daniel Levy, Serena Yeung,
SIMD to accelerate the linear computation in the inference                   Albert Haque, and Li Fei-Fei. Faster cryptonets: Lever-
by packing multiple messages into one ciphertext, and they                   aging sparsity for real-world encrypted inference. arXiv
used GC to compute the relu activation and max-pooling lay-                  preprint arXiv:1811.09953, 2018.
ers. Zhang et al. [38] improved this solution by reducing the
permutation operations when performing dot-product based                 [7] Victor Costan and Srinivas Devadas. Intel sgx explained.
on packed ciphertexts.                                                       IACR Cryptol. ePrint Arch., 2016(86):1–118, 2016.
   The trusted execution environment technology (TEE, e.g.,
Intel SGX [7]) is also an interesting approach to build oblivi-          [8] Ivan Damgård, Valerio Pastro, Nigel Smart, and Sarah
ous inference frameworks (e.g., [35]). Generally, most TEE-                  Zakarias. Multiparty computation from somewhat ho-
based frameworks are more efficient than cryptography-based                  momorphic encryption. In Annual Cryptology Confer-
solutions [35]. However, this approach requires trust in hard-               ence, pages 643–662. Springer, 2012.
ware vendors and the implementation of the enclave. We leave             [9] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin
this discussion of TEE-based solutions out of the scope of                   Ko, Susan M Swetter, Helen M Blau, and Sebastian
this paper.                                                                  Thrun. Dermatologist-level classification of skin cancer
                                                                             with deep neural networks. nature, 542(7639):115–118,
7   Conclusion                                                               2017.

This work presented a concise oblivious neural network in-              [10] Craig Gentry et al. A fully homomorphic encryption
ference framework, the Popcorn. This framework was com-                      scheme, volume 20. Stanford university Stanford, 2009.
pletely built on the Paillier HE scheme. It is easy to imple-           [11] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine,
ment, one only needs to replace the algebraic operations of                  Kristin Lauter, Michael Naehrig, and John Wernsing.
existing networks with their corresponding homomorphic                       Cryptonets: Applying neural networks to encrypted data
operations. We conducted experiments on different datasets                   with high throughput and accuracy. In International Con-
(MNIST, CIFAR-10 and ImageNet), and showed its signif-                       ference on Machine Learning, pages 201–210. PMLR,
icant advantage in communication bandwidth. To our best                      2016.
knowledge, this work is the first report for oblivious inference
benchmark on ImageNet-scale classification tasks.                       [12] Song Han, Huizi Mao, and William J Dally. Deep com-
                                                                             pression: Compressing deep neural networks with prun-
References                                                                   ing, trained quantization and huffman coding. arXiv
                                                                             preprint arXiv:1510.00149, 2015.
 [1] Mauro Barni, Claudio Orlandi, and Alessandro Piva. A
                                                                        [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     privacy-preserving protocol for neural-network-based
                                                                             Sun. Deep residual learning for image recognition. In
     computation. In Proceedings of the 8th workshop on
                                                                             Proceedings of the IEEE conference on computer vision
     Multimedia and security, pages 146–151, 2006.
                                                                             and pattern recognition, pages 770–778, 2016.
 [2] Amos Beimel. Secret-sharing schemes: a survey. In In-
                                                                        [14] Alberto Ibarrondo and Melek Önen. Fhe-compatible
     ternational conference on coding and cryptology, pages
                                                                             batch normalization for privacy preserving deep learn-
     11–46. Springer, 2011.
                                                                             ing. In Data Privacy Management, Cryptocurrencies
 [3] Mihir Bellare, Viet Tung Hoang, and Phillip Rogaway.                    and Blockchain Technology, pages 389–404. Springer,
     Foundations of garbled circuits. In Proceedings of the                  2018.
     2012 ACM conference on Computer and communica-
     tions security, pages 784–796, 2012.                               [15] Sergey Ioffe and Christian Szegedy. Batch normaliza-
                                                                             tion: Accelerating deep network training by reducing
 [4] Florian Bourse, Michele Minelli, Matthias Minihold,                     internal covariate shift. In International conference on
     and Pascal Paillier. Fast homomorphic evaluation of                     machine learning, pages 448–456. PMLR, 2015.

                                                                   12
You can also read