Secure Bilevel Asynchronous Vertical Federated Learning with Backward Updating

Page created by Bryan Gilbert
 
CONTINUE READING
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

                     Secure Bilevel Asynchronous Vertical Federated Learning
                                     with Backward Updating
                      Qingsong Zhang1,2 , Bin Gu 3,4 , Cheng Deng1,∗ , and Heng Huang3,5∗
        1
            School of Electronic Engineering, Xidian University, Xi’an 710071, China; 2 JD Tech, Beijing 100176, China.
                3
                  JD Finance America Corporation, Mountain View, CA, USA; 4 MBZUAI, United Arab Emirates
                             5
                               Electrical and Computer Engineering, University of Pittsburgh, PA, USA
                                   {qszhang1995, jsgubin, chdeng.xd, henghuanghh}@gmail.com

                            Abstract                                         homomorphic mathematical operation on ciphertext field is
                                                                             very high, thus HE is extremely time consuming for modeling
  Vertical federated learning (VFL) attracts increasing attention            (Liu, Ng, and Zhang 2015; Liu et al. 2019). Second, approxi-
  due to the emerging demands of multi-party collaborative mod-
  eling and concerns of privacy leakage. In the real VFL appli-
                                                                             mation is required for HE to support operations of non-linear
  cations, usually only one or partial parties hold labels, which            functions, such as Sigmoid and Logarithmic functions, which
  makes it challenging for all parties to collaboratively learn              inevitably causes loss of the accuracy for various machine
  the model without privacy leakage. Meanwhile, most existing                learning models using non-linear functions (Kim et al. 2018;
  VFL algorithms are trapped in the synchronous computations,                Yang et al. 2019a). Thus, the inefficiency and inaccuracy of
  which leads to inefficiency in their real-world applications. To           HE based methods dramatically limit their wide applications
  address these challenging problems, we propose a novel VFL                 to realistic VFL tasks.
  framework integrated with new backward updating mecha-                        ERCR based methods (Zhang et al. 2018; Hu et al. 2019;
  nism and bilevel asynchronous parallel architecture (VFB2 ),
                                                                             Gu et al. 2020b) leverage labels and the raw intermediate com-
  under which three new algorithms, including VFB2 -SGD, -
  SVRG, and -SAGA, are proposed. We derive the theoretical                   putational results transmitted from the other parties to com-
  results of the convergence rates of these three algorithms un-             pute stochastic gradients, and thus use distributed stochastic
  der both strongly convex and nonconvex conditions. We also                 gradient descent (SGD) methods to train VFL models effi-
  prove the security of VFB2 under semi-honest threat models.                ciently. Although ERCR based methods circumvent afore-
  Extensive experiments on benchmark datasets demonstrate                    mentioned drawbacks of HE based methods, existing ERCR
  that our algorithms are efficient, scalable and lossless.                  based methods are designed with only considering that all
                                                                             parties have labels, which is not usually the case in real-world
                                                                             VFL tasks. In realistic VFL applications, usually only one
                        Introduction                                         or partial parties (denoted as active parties) have the labels,
Federated learning (McMahan et al. 2016; Smith et al. 2017;                  and the other parties (denoted as passive parties) can only
Kairouz et al. 2019) has emerged as a paradigm for collab-                   provide extra feature data but do not have labels. When these
orative modeling with privacy-preserving. A line of recent                   ERCR based methods are applied to the real situation with
works (McMahan et al. 2016; Smith et al. 2017) focus on                      both active and passive parties, the algorithms even cannot
the horizontal federated learning, where each party has a sub-               guarantee the convergence because only active parties can
set of samples with complete features. There are also some                   update the gradient of loss function based on labels but the
works (Gascón et al. 2016; Yang et al. 2019b; Dang et al.                   passive parties cannot, i.e. partial model parameters are not
2020) studying the vertical federated learning (VFL), where                  optimized during the training process. Thus, it comes to the
each party holds a disjoint subset of features for all samples.              crux of designing the proper algorithm for solving real-world
In this paper, we focus on VFL that has attracted much at-                   VFL tasks with only one or partial parties holding labels.
tention due to its wide applications to emerging multi-party                    Moreover, algorithms using synchronous computation
collaborative modeling with privacy-preserving.                              (Gong, Fang, and Guo 2016; Zhang et al. 2018) are ineffi-
   Currently, there are two mainstream methods for VFL, in-                  cient when applied to real-world VFL tasks, especially, when
cluding homomorphic encryption (HE) based methods and                        computational resources in the VFL system are unbalanced.
exchanging the raw computational results (ERCR) based                        Therefore, it is desired to design the efficient asynchronous
methods. The HE based methods (Hardy et al. 2017; Cheng                      algorithms for real-world VFL tasks. Although there have
et al. 2019) leverage HE techniques to encrypt the raw data                  been several works studying asynchronous VFL algorithms
and then use the encrypted data (ciphertext) for training                    (Hu et al. 2019; Gu et al. 2020b), it is still an open problem to
model with privacy-preserving. However, there are two major                  design asynchronous algorithms for solving real-world VFL
drawbacks of HE based methods. First, the complexity of                      tasks with only one or partial parties holding labels.
   ∗
      Corresponding Authors                                                     In this paper, we address these challenging problems by
Copyright c 2021, Association for the Advancement of Artificial              proposing a novel framework (VFB2 ) integrated with the
Intelligence (www.aaai.org). All rights reserved.                            novel backward updating mechanism (BUM) and bilevel

                                                                     10896
asynchronous parallel architecture (BAPA). Specifically, the                   Algorithm 1 Safe algorithm of obtaining wT xi .
BUM enables all parties, rather than only active parties, to
collaboratively update the model securely and also makes                       Input: {wG`0 }q`0 =1 and {(xi )G`0 }q`0 =1 allocating at each
the final model lossless; the BAPA is designed for efficiently                     party, index i.
asynchronous backward updating. Considering the advan-                             Do this in parallel
tages of SGD-type algorithms in optimizing machine learn-                       1: for `0 = 1, · · · , q do
ing models, we thus propose three new SGD-type algorithms,                      2:    Generate a ramdon number δ`0 and calculate
i.e., VFB2 -SGD, -SVRG and -SAGA, under that framework.                               wG>`0 (xi )G`0 + δ`0 ,
We summarize the contributions of this paper as follows.                        3: end for           Pq          >
                                                                                4: Obtain ξ1 =           `0 =1 (wG`0 (xi )G`0 + δ` ) through tree
                                                                                                                                  0
• We are the first to propose the novel backward updating
  mechanism for ERCR based VFL algorithms, which en-                               structure T1 . P
                                                                                                        q
  ables all parties, rather than only parties holding labels, to                5: Obtain ξ2 =          `0 =1 δ`0 through totally different tree
  collaboratively learn the model with privacy-preserving                          structure T2 6= T1 .
  and without hampering the accuracy of final model.                           Output: w> xi = ξ1 − ξ2
• We design a bilevel asynchronous parallel architecture
  that enables all parties asynchronously update the model
  through backward updating, which is efficient and scalable.                  security, only active parties know the form of the loss func-
                                                                               tion. Moreover, we assume that the labels can be shared by
• We propose three new algorithms for VFL, including                           all parties finally. Note that this does not obey our intention
  VFB2 -SGD, -SVRG, and -SAGA under VFB2 . Moreover,                           that only active parties hold the labels before training. The
  we theoretically prove their convergence rates for both                      problem studied in this paper is stated as follows:
  strongly convex and nonconvex problems.                                      Given: Vertically partitioned data {xG` }q`=1 stored in q par-
Notations. w b denotes the inconsistent read of w. w̄ denotes                  ties and the labels only held by active parties.
w to compute local stochastic gradient of loss function for                    Learn: A machine learning model M collaboratively learned
collaborators, which maybe stale due to communication delay.                   by both active and passive parties without leaking privacy.
ψ(t) is the corresponding party performing the t-th global                     Lossless Constraint: The accuracy of M must be compara-
iteration. Given a finite set S, |S| denotes its cardinality.                  ble to that of model M0 learned under non-federated learning.

                  Problem Formulation                                                              VFB2 Framework
Given a training set {xi , yi }ni=1 , where yi ∈ {−1, +1} for                  In this section, we propose the novel VFB2 framework. VFB2
binary classification task or yi ∈ R for regression prob-                      is composed of three components and its systemic structure
lem and xi ∈ Rd , we consider the model in a linear form                       is illustrated in Fig. 1a. The details of these components are
of w> x, where w ∈ Rd corresponds to the model param-                          presented in the following.
eters. For VFL, xi is vertically distributed among q ≥ 2                          The key of designing the proper algorithm for solving
parties, i.e., xi = [(xi )G1 ; · · · ; (xi )Gq ], where (xi )G` ∈ Rd`          real-world VFL tasks with both active and passive parties
                                     Pq
is stored on the `-th party and `=1 d` = d. Similarly, there                   is to make the passive parties utilize the label information
is w = [wG1 ; · · · ; wGq ]. Particularly, we focus on the follow-             for model training. However, it is challenging to achieve this
ing regularized empirical risk minimization problem.                           because direct using the labels hold by active parties leads to
                       n                      q                                privacy leakage of the labels without training. To address this
                    1X                       X
                                                                               challenging problem, we design the BUM with painstaking.
                          L w > xi , y i + λ
                                        
   min f (w) :=                                  g(wG` ),        (P)
  w∈Rd              n i=1                                                      Backward Updating Mechanism: The key idea of BUM
                                             `=1
                          |               {z          }                        is to make passive parties indirectly use labels to compute
                                         fi (w)                                stochastic gradient without directly accessing the raw label
                  Pq                                                           data. Specifically, the BUM embeds label yi into an interme-
where w> xi = `=1 wG>` (xi )G` , L denotes the loss func-                                           ∂L(w> xi ,yi )
      Pq                                                                       diate value ϑ := ∂(w> xi ) . Then ϑ and i are distributed
tion, `=1 g(wG` ) is the regularization term, and fi : Rd →
R is smooth and possibly nonconvex. Examples of problem P                      backward to the other parties. Consequently, the passive par-
include models for binary classification tasks (Conroy and                     ties can also compute the stochastic gradient and update the
Sajda 2012; Wang et al. 2017) and models for regression                        model by using the received ϑ and i (please refer to Algo-
tasks (Shen et al. 2013; Wang et al. 2019).                                    rithms 2 and 3 for details). Fig. 1b depicts the case where
   In this paper, we introduce two types of parties: active                    ϑ is distributed from party 1 to the rest parties. In this case,
party and passive party, where the former denotes data                         all parties, rather than only active parties, can collaboratively
provider holding labels while the latter does not. Particularly,               learn the model without privacy leakage.
in our problem setting, there are m (1 ≤ m ≤ q) active                            For VFL algorithms with BUM, dominated updates in dif-
parties. Each active party can play the role of dominator in                   ferent active parties are performed in distributed-memory
model updating by actively launching updates. All parties,                     parallel, while collaborative updates within a party are per-
including both active and passive parties, passively launching                 formed in shared-memory parallel. The difference of paral-
updates play the role of collaborator. To guarantee the model                  lelism fashion leads to the challenge of developing a new

                                                                       10897
Coordinator
                      : Active                                 : Passive
                                  =1 (wi )T ( xi )
                                  q

                                 Secure Aggregation
             Party 1         Party m            Party m+1      Party q
             Async.          Async.              Async.        Async.
             Update          Update              Update        Update

             D   =1
                                 D   =m
                                                  D   = m +1
                                                                 D   =q

                                          (a)                                                             (b)

Figure 1: (a): System structure of VFB2 framework. (b): Illustration of the BUM and BAPA, where k is the number of threads.

parallel architecture instead of just directly adopting the ex-                    Algorithm 2 VFB2 -SGD for active party ` to actively launch
isting asynchronous parallel architecture for VFL. To tackle                       dominated updates.
this challenge, we elaborately design a novel BAPA.
Bilevel Asynchronous Parallel Architecture: The BAPA                               Input: Local data {(xi )G` , yi }ni=1 stored on the `-th party,
includes two levels of parallel architectures, where the up-                           learning rate γ.
per level denotes the inner-party parallel and the lower one                        1: Initialize the necessary parameters.
is the intra-party parallel. More specifically, the inner-party                        Keep doing in parallel (distributed-memory parallel
parallel denotes distributed-memory parallel between active                            for multiple active parties)
parties, which enables all active parties to asynchronously                         2:    Pick up an index i randomly
                                                                                                              Pq            from {1, ..., n}.
launch dominated updates; while the intra-party one denotes                         3:    Compute w  b> xi = `0 =1 w     bG>`0 (xi )G`0 based on Al
the shared-memory parallel of collaborative updates within                                gorithm 1.
                                                                                                          ∂L(wb> xi ,yi )
each party, which enables multiple threads within a specific                        4:    Compute ϑ = ∂(wb> xi ) .
party to asynchronously perform the collaborative updates.                          5:    Send ϑ and index i to collaborators.
Fig. 1b illustrates the BAPA with m active parties.                                 6:    Compute ve` = ∇G` fi (w).
   To utilize feature data
                         Pq provided by other parties, a party
                                                                                                                   b
                                                                                    7:    Update wG` ← wG` − γe      v` .
need obtain wT xi = `=1 wG>` (xi )G` . Many recent works
                                                                                       End parallel
achieved this by aggregating the local intermediate computa-
tional results securely (Hu et al. 2019; Gu et al. 2020a). In
this paper, we use the efficient tree-structured communication
scheme (Zhang et al. 2018) for secure aggregation, whose                           chine learning (ML) models. However, it has a poor conver-
security was proved in (Gu et al. 2020a).                                          gence rate due to the intrinsic variance of stochastic gradient.
Secure Aggregation Strategy: The details are summarized                            Thus, many popular variance reduction techniques have been
in Algorithm 1. Specifically, at step 2, wG>` (xi )G` is com-                      proposed, including the SVRG, SAGA, SPIDER (Johnson
puted locally on the `-th party to prevent the direct leakage                      and Zhang 2013; Defazio, Bach, and Lacoste-Julien 2014;
of wG` and (xi )G` . Especially, a random number δ` is added                       Wang et al. 2019) and their applications to other problems
to wG>` (xi )G` to mask the value of wG>` (xi )G` , which can en-                  (Huang, Chen, and Huang 2019; Huang et al. 2020; Zhang
hance the security during aggregation process. At steps 4                          et al. 2020; Dang et al. 2020; Yang et al. 2020a,b; Li et al.
and 5, ξ1 and ξ2 are aggregated through tree structures T1                         2020; Wei et al. 2019). In this section we raise three SGD-
and T2 , respectively. Note that T2 is totally different from                      type algorithms, i.e. the SGD, SVRG and SAGA, which
T1 that can prevent the random value being removed un-                             are the most popular ones among SGD-type methods for
der threatP  model 1 (defined in section ). Finally, value of                      the appealing performance in practice. We summarize the
               q
w> xi = `=1 (wG>` (xi )G` is recovered by removing term                            detailed steps of VFB2 -SGD in Algorithms 2 and 3. For
Pq               Pq       >                                                        VFB2 -SVRG and -SAGA, one just needs to replace the up-
   `=1 δ` from      `=1 (wG` (xi )G` + δ` ) at the output step. Us-
                                                                                   date rule with corresponding one.
ing such aggregation strategy, (xi )G` and wG` are prevented
from leaking during the aggregation.                                                  As shown in Algorithm 2, at each dominated update, the
                                                                                   dominator (an active party) calculates ϑ and then distributes
Secure Bilevel Asynchronous VFL Algorithms                                         ϑ together with i to the collaborators (the rest q − 1 par-
                                                                                   ties). As shown in Algorithm 3, for party `, once it has re-
          with Backward Updating                                                   ceived the ϑ and i, it will launch a new collaborative update
  SGD (Bottou 2010) is a popular method for learning ma-                           asynchronously. As for the dominator, it computes the local

                                                                           10898
Algorithm 3 VFB2 -SGD for the `-th party to passively                        Given wb as the inconsistent read of w, which is used to
launch collaborative updates.                                              compute the stochastic gradient in dominated updates, fol-
                                                                           lowing the analysis in (Gu et al. 2020b), we have
Input: Local data {(xi )G` , yi }ni=1 stored on the `-th party,                                         X
    learning rate γ.                                                                    bt − wt = γ
                                                                                        w                     Uψ(u) veuψ(u) ,     (4)
 1: Initialize the necessary parameters (for passive parties).                                           u∈D(t)
    Keep doing in parallel (shared-memory parallel for
    multiple threads)                                                      where D(t) = {t − 1, · · · , t − τ0 } is a subset of non-
 2:    Receive ϑ and the index i from the dominator.                       overlapped previous iterations with τ0 ≤ τ1 . Given w̄ as
 3:    Compute ve` = ∇G` L(w̄) + λ∇G` g(w)   b = ϑ · (xi )G` +             the parameter used to compute the ∇G` L in collaborative up-
       λ∇g(w  bG` ).                                                       dates, which is the steal state of w
                                                                                                              b due to the communication
                                                                           delay between the specific dominator and its correspond-
 4:    Update wG` ← wG` − γe    v` .
                                                                           ing collaborators. Then, following the analyses in (Huo and
 5: End parallel
                                                                           Huang 2017), there is
                                                                                                                               ψ(t0 )
                                                                                                               X
                                                                                   w̄t = wbt−τ0 = w  bt + γ          Uψ(t0 ) vet0 ,   (5)
stochastic gradient as ∇G` fi (w)b = ∇G` L(w)  b + λ∇g(w  bG` ).                                              t0 ∈D 0 (t)
While, for the collaborator, it uses the received ϑ to compute                       0
                                                                           where D (t) = {t − 1, · · · , t − τ0 } is a subset of previous
∇G` L and local w b to compute ∇G` g as shown at step 3 in
                                                                           iterations performed during the communication and τ0 ≤ τ2 .
Algorithm 3. Note that active parties also need perform Algo-
rithm 3 to collaborate with other dominators to ensure that                Convergence Analysis for Strongly Convex
the model parameters of all parties are updated.                           Problem
                  Theoretical Analysis                                     Assumption 4. Each function fi , i = 1, . . . , n, is µ-strongly
                                                                           convex, i.e., ∀ w, w0 ∈ Rd there exists a µ > 0 such that
 In this section, we provide the convergence analyses. Please                                                           µ
 see the arXiv version for more details. We first present pre-               fi (w) ≥ fi (w0 ) + h∇fi (w0 ), w − w0 i + kw − w0 k2 . (6)
 liminaries for strongly convex and nonconvex problems.                                                                 2
                                                                              For strongly convex problem, we introduce notation K(t)
Assumption 1. For fi (w) in problem P, we assume the fol-                  that denotes a minimum set of successive iterations fully vis-
 lowing conditions hold:                                                   iting all coordinates from global iteration number t. Note that
1. Lipschitz Gradient: Each function fi , i = 1, . . . , n, there          this is necessary for the asynchronous convergence analyses
    exists L > 0 such that for ∀ w, w0 ∈ Rd , there is                     of the global model. Moreover, we assume that the size of
             k∇fi (w) − ∇fi (w0 )k ≤ Lkw − w0 k.             (1)           K(t) is upper bounded by η1 , i.e., |K(t)| ≤ η1 . Based on
                                                                           K(t), we introduce the epoch number v(t) as follow.
2. Block-Coordinate Lipschitz Gradient: For i =                            Definition 1. Let P (t) be a partition of {0, 1, · · · , t − σ 0 },
   1, . . . , n, there exists an L` > 0 for the `-th block G` ,            where σ 0 ≥ 0. For any κ ⊆ P (t) we have that there exists
   where ` = 1, · · · , q such that                                        t0 ≤ t such that K(t0 ) = κ, and κ1 ⊆ P (t) such that
       k∇G` fi (w + U` ∆` ) − ∇G` fi (w)k ≤ L` k∆` k,        (2)           K(0) = κ1 . The epoch number for the t-th global iteration,
                                                                           i.e., v(t) is defined as the maximum cardinality of P (t).
   where ∆` ∈ Rd` , U` ∈ Rd×d` and [U1 , · · · , Uq ] = Id .
                                                                              Given the definition of epoch number v(t), we have the
3. Bounded Block-Coordinate Gradient: There exists a                       following theoretical results for µ-strongly convex problem.
   constant G such that for fi , i = 1, · · · , n and block G` ,
   ` = 1, · · · , q, it holds that k∇G` fi (w)k2 ≤ G.                      Theorem 1. Under Assumptions 1-3 and 4, to achieve
                                                                           the accuracy  of problem P for VFB2 -SGD, i.e.,
Assumption 2. The regularization term g is Lg -smooth,                                                                  µ1/3
                                                                           E(f (wt ) − f (w∗ )) ≤ , let γ ≤ (G96L         2 )1/3 , if τ  ≤
 which means that there exists an Lg > 0 for ` = 1, . . . , q                                                                 ∗
                                                                                      (GL2∗ )2/3
 such that ∀wG` , wG0 ` ∈ Rd` there is                                     min{−4/3 ,           } , the epoch number v(t) should sat-
                                                                                       2 µ2/3
                                                                                     44(GL2∗ )1/3               (w∗ ))
        k∇g(wG` ) − ∇g(wG0 ` )k ≤ Lg kwG` − wG0 ` k.         (3)           isfy v(t)≥ µ4/3  log( 2(f (w0 )−f          ) , where L∗ =
   Assumption 2 imposes the smoothness on g, which is nec-                 max{L, {L` }q`=1 , Lg }, τ = max{τ12 , τ22 , η12 }, w0 and w∗ de-
essary for the convergence analyses. Because, as for a specific            note the initial point and optimal point, respectively.
collaborator, it uses the received w
                                   b (denoted as w̄) to com-               Theorem 2. Under Assumptions 1-3 and 4, to achieve the
pute ∇G` L and local wb to compute ∇G` g = ∇g(wG` ), which                 accuracy  of problem P for VFB2 -SVRG, let C = (L2∗ γ +
makes it necessary to track the behavior of g individually.                       2                   16L2∗ η1 C
                                                                           L∗ ) γ2 and ρ = γµ     2 −     µ      , we can carefully choose γ
Similar to previous research works (Lian et al. 2015; Huo
and Huang 2017; Leblond, Pedregosa, and Lacoste-Julien                     such that
2017), we introduce the bounded delay as follows.                                                                          8L2∗ τ 1/2 C
                                                                                 1) 1 − 2L2∗ γ 2 τ > 0; 2) ρ > 0; 3)                    ≤ 0.05;
Assumption 3. Bounded Delay: Time delays of inconsis-                                                                          ρµ
tent reading and communication between dominator and its                                                             36G             
collaborators are upper bounded by τ1 and τ2 , respectively.                     4) L2∗ γ 2 τ 3/2 (28C + 5γ)             2  2
                                                                                                                                  ≤ ,        (7)
                                                                                                               ρ(1 − 2L∗ γ τ )       8

                                                                   10899
log0.25                                                               8                                                             8                                                          8
where v(t) should satisfy v(t) ≥         log(1−ρ) and the outer    loop

                                                                                                                                                   Multiple-parties speedup
                                                                                     Multiple-parties speedup

                                                                                                                                                                                                              Multiple-parties speedup
                                                                                                                      Ideal                                                       Ideal                                                      Ideal
                                                                                                                      Async(ours)                                                 Async(ours)                                                Async(ours)
                                                         (w∗ )                                                        Sync                                                        Sync                                                       Sync
number S should satisfy S ≥          1
                                   log 43
                                          log 2f (w0 )−f
                                                              .                                                4                                                             4                                                          4

Theorem 3. Under Assumptions 1-3 and 4, to achieve
the accuracy  of problem P for VFB2 -SAGA, let c0 =                                                            1            4
                                                                                                                                 #Party
                                                                                                                                               8                              1          4
                                                                                                                                                                                             #Party
                                                                                                                                                                                                          8                              1          4
                                                                                                                                                                                                                                                        #Party
                                                                                                                                                                                                                                                                       8

                                                         18GL2
 2γ 3 τ 3/2 + (L2∗ γ 3 τ + L∗ γ 2 )180γ 2 τ 3/2 + 8γ 2 τ 1−72L2 γ∗ 2 τ ,                                            (a) SGD-based                                             (b) SVRG-based                                             (c) SAGA-based
                                                                    ∗
                                                                   L2∗ τ
c1 = 2L2∗ τ (L2∗ γ 3 τ + L∗ γ 2 ), c2 = 4(L2∗ γ 3 τ + L∗ γ 2 )      n      ,       Figure 2: q-parties speedup scalability with m = 2 on D4 .
and ρ ∈ (1 − n1 , 1), we can choose γ such that
                                       γµ
        1) 1 − 72L2∗ γ 2 τ > 0; 2) 0 < 1 −  < 1;                                   for stochastic variable w, for VFB2 -SGD, let γ =                                                                                                                               
                                                                                                                                                                                                                                                                 L∗ qG ,   if
                                        4                                                                           512qG
                       4c                                                         τ≤             the total epoch number T should satisfy
                                                                                                                      2 ,
        3)           20           ≤ ;
                                                                                                            E f (w0 ) − f ∗ L∗ qG
                                                                                                                          
                      γµ
           γµ(1 − ρ) 4 − 2c1 − c2        2
                                                                                                      T ≥                           ,           (10)
                                                                                                                      2
             γµ2                      1 − n1 −1
                                                 
        4) −      + 2c1 + c2 1 + (1 −         )     ≤ 0;                           where L∗ = max{L, {L` }q`=1 , Lg }, τ = max{τ12 , τ22 , η22 },
              4                           ρ                                        f (w0 ) is the initial function value and f ∗ is defined in Eq. 9.
             γµ2                     1 − n1 −1
                                                
        5) −      + c2 + c1 2 + (1 −        )      ≤ 0,(8)                         Theorem 5. Under Assumptions 1-3 and 5, to solve prob-
              4                         ρ                                          lem P with VFB2 -SVRG, let γ = Lm       0
                                                                                                                          ∗n
                                                                                                                             α , where 0 < m0 < 8 ,
                                                                                                                                                   1

                                                                                   0 < α ≤ 1, if epoch number N in an outer loop satisfies
the     epoch number v(t) should               satisfy   v(t)         ≥                     nα                     n2α 1−8m0
  1         2(2ρ−1+ γµ
                     4 ) (f (w0 )−f (w∗ ))                                         N ≤ b 2m    0
                                                                                                 c, and τ < min{ 20m  2 , 40m2 }, there is
    1
log ρ
        log             2              .                                                                                                                                                      0         0
                   γµ
            (ρ−1+ 4 ) γµ  4 −2c1 −c2                                                                               S N −1
                                                                                        1                           X X                                                                   L∗ n E [f (w0 ) − f (w∗ )]
                                                                                                                                                                                                      α

Remark 1. For strongly convex problems, given the assump-                                                                           E||∇f (wts0 )||2 ≤                                                               , (11)
                                                                                        T                           s=1 t=0
                                                                                                                                                                                                     Tσ
tions and parameters in corresponding theorems, the con-
vergence rate of VFB2 -SGD is O( 1 log( 1 )), and those of                        where T is the total number of epoches, t0 is the start itera-
VFB2 -SVRG and VFB2 -SAGA are O(log( 1 )).                                        tion of epoch t, σ is a small value independent of n.
                                                                                   Theorem 6. Under Assumptions 1-3 and 5, to solve prob-
Convergence Analysis for Nonconvex Problem                                         lem P with VFB2 -SAGA, let γ = Lm   ∗n
                                                                                                                         0                      1
                                                                                                                          α , where 0 < m0 < 20 ,

Assumption 5. Nonconvex function f (w) is bounded below,                                                                                     nα
                                                                                   0 < α ≤ 1, if total epoch number T satisfies T ≤ b 4m0 c
                                                                                                                                          2α
                   f ∗ := inf f (w) > −∞.                           (9)                          n
                                                                                   and τ < min{ 180m 2,
                                                                                                        1−20m0
                                                                                                         40m2
                                                                                                               }, there is
                           w∈Rd                                                                                                                0                                    0

                                                                                                                    T −1
   Assumption 5 guarantees the feasibility of nonconvex prob-                          1                            X                                                                L∗ nα E [f (w0 ) − f (w∗ )]
lem (P). For nonconvex problem, we introduce the nota-                                                                     E||∇f (wt0 )||2 ≤                                                                     . (12)
                                                                                       T                                                                                                         Tσ
tion K 0 (t) that denotes a set of q iterations fully visiting                                                      t=0

all coordinates, i.e., K 0 (t) = {{t, t + t̄1 , · · · , t + t̄q−1 } :              Remark 2. For nonconvex problems, given conditions in
                                                                                                                                       √the
ψ({t, t + t̄1 , · · · , t + t̄q−1 }) = {1, · · · , q}}, where the t-               theorems, the convergence rate of VFB2 -SGD is O(1/ T ),
th global iteration denotes a dominated update. Moreover,                          and those of VFB2 -SVRG and VFB2 -SAGA are O(1/T ).
these iterations are performed respectively on a dominator
and q − 1 different collaborators receiving ϑ calculated at                                                                                Security Analysis
the t-th global iteration. Moreover, we assume that K 0 (t)
                                                                                   We discuss the data security and model security of VFB2 un-
can be completed in η2 global iterations, i.e., for ∀t0 ∈ A(t),
                                                                                   der two semi-honest threat models commonly used in security
there is η2 ≥ max{u|u ∈ K 0 (t0 )} − t0 . Note that, different
                                                                                   analysis (Cheng et al. 2019; Xu et al. 2019; Gu et al. 2020a).
from K(t), there is |K 0 (t)| = q and the definition of K 0 (t)
                                                                                   Specially, these two threat models have different threat abili-
does not emphasize on “successive iterations” due to the
                                                                                   ties, where threat model 2 allows collusion between parties
difference of analysis techniques between strongly convex
                                                                                   while threat model 1 does not.
and nonconvex problems. Based on K 0 (t), we introduce the
epoch number v 0 (t) as follow.                                                    • Honest-but-curious (threat model 1): All workers will
                                                                                      follow the algorithm to perform the correct computations.
Definition 2. A(t) denotes a set of global iterations, where                          However, they may use their own retained records of the
for ∀ t0 ∈ A(t) there is the t0 -th global iteration denoting a                       intermediate computation result to infer other worker’s
dominated update and ∪∀t0 ∈A(t) K 0 (t0 ) = {0, 1, · · · , t}. The                    data and model.
epoch number v 0 (t) is defined as |A(t)|.
                                                                                   • Honest-but-colluding (threat model 2): All workers will
   Give the definition of epoch number v 0 (t), we have the                           follow the algorithm to perform the correct computa-
following theoretical results for nonconvex problem.                                  tions. However, some workers may collude to infer other
Theorem 4. Under Assumptions 1-3 and 5, to achieve the -                             worker’s data and model by sharing their retained records
first-order stationary point of problem P, i.e. Ek∇f (w)k ≤                          of the intermediate computation result.

                                                                           10900
Similar to (Gu et al. 2020a), we prove the security of VFB2 by                                 Financial              Large-Scale
analyzing and proving its ability to prevent inference attack
defined as follows.                                                                         D1            D2         D3             D4
                                                                          #Samples        24,000         96,257    17,996       175,000
Definition 3 (Inference attack). An inference attack on the `-
                                                                          #Features         90             92     1,355,191    16,609,143
th party is to infer (xi )G` (or wG` ) belonging to other parties
or yi hold by active parties without directly accessing them.
                                                                                           Table 1: Dataset Descriptions.
Lemma 1. Given an equation oi = wG>` (xi )G` or oi =
∂L(w b> xi ,yi )
     b > xi )
   ∂(w
                 with only oi being known, there are infinite dif-       rest as the testing data. An optimal learning rate γ is cho-
ferent solutions to this equation.                                       sen from {5e−1 , 1e−1 , 5e−2 , 1e−2 , · · · } with regularization
  The proof of lemma 1 is shown in the arXiv version. Based              coefficient λ = 1e−4 for all experiments.
on lemma 1, we obtain the following theorem.                             Datasets: We use four classification datasets summarized
                                                                         in Table 1 for evaluation. Especially, D1 (UCICreditCard)
Theorem 7. Under two semi-honest threat models, VFB2                     and D2 (GiveMeSomeCredit) are the real financial datsets
can prevent the inference attack.                                        from the Kaggle website∗ , which can be used to demon-
Feature and model security: During the aggregation, the                  strate the ability to address real-world tasks; D3 (news20)
value of oi = wG>` (xi )G` is masked by δ` and just the value of         and D4 (webspam) are the large-scale ones from the LIB-
                                                                         SVM (Chang and Lin 2011) website† . Note that we apply
wG>` (xi )G` +δ` is transmitted. Under threat model 1, one even          one-hot encoding to categorical features of D1 and D2 , thus
can not access the true value of oi , let alone using relation           the number of features become 90 and 92, respectively.
oi = wG>` (xi )G` to refer wG>` and (xi )G` . Under threat model         Problems: We consider `2 -norm regularized logistic regres-
2, the random value δ` has risk of being removed from term               sion problem for µ-strong convex case
wG>` (xi )G` + δ` by colluding with other parties. Applying
                                                                                                     n
lemma 1 to this circumstance, and we have that even if the                                    1X                  >      λ
random value is removed it is still impossible to exactly refer            min f (w) :=             log(1 + e−yi w xi ) + kwk2 , (13)
                                                                           w∈Rd               n i=1                      2
wG>` and (xi )G` . Thus, the aggregation process can prevent
inference attack under two semi-honest threat models.                    and the nonconvex logistic regression problem
Label security: When analyze the security of label, we do                                        n                              d
not consider the collusion between active parties and pas-                                  1X                  >       λ X wi2
                                                                          min f (w) :=            log(1 + e−yi w xi ) +               .
sive parties, which will make preventing labels from leaking             w∈Rd               n i=1                       2 i=1 1 + wi2
meaningless. In the backward updating process, if a passive
party ` wants to infer yi through the received ϑ, it must solve          Evaluations of Asynchronous Efficiency and
                         b> xi ,yi )
                      ∂L(w                                               Scalability
the equation ϑ = ∂(wb> xi ) . However, only ϑ is known
to party `. Thus, following from lemma 1, we have that it is             To demonstrate the asynchronous efficiency, we introduce
impossible to exactly infer the labels. Moreover, the collusion          the synchronous counterparts of our algorithms (i.e., syn-
between passive parties has no threats to the security of la-            chronous VFL algorithms with BUM, denoted as VFB)
bels. Therefore, the backward updating can prevent inference             for comparison. When implementing the synchronous algo-
attack under two semi-honest threat models.                              rithms, there is a synthetic straggler party which may be 30%
   From above analyses, we have that the feature security,               to 50% slower than the faster party to simulate the real appli-
label security and model security are guaranteed in VFB2 .               cation scenario with unbalanced computational resource.
                                                                         Asynchronous Efficiency: In these experiments, we set
                                                                         q = 8, m = 3 and fix the γ for algorithms with a same
                       Experiments                                       SGD-type but in different parallel fashions. As shown in
In this section, extensive experiments are conducted to                  Figs. 3 and 4, the loss v.s. run time curves demonstrate that
demonstrate the efficiency, scalability and losslessness of              our algorithms consistently outperform their synchronous
our algorithms. More experiments are presented in the arXiv              counterparts regarding the efficiency.
version.                                                                    Moreover, from the perspective of loss v.s. epoch number,
Experiment Settings: All experiments are implemented on                  we have that algorithms based on SVRG and SAGA have the
a machine with four sockets, and each sockets has 12 cores.              better convergence rate than that of SGD-based algorithms
To simulate the environment with multiple machines (or par-              which is consistent to the theoretical results.
ties), we arrange an extra thread for each party to schedule its         Asynchronous Scalability: We also consider the asyn-
k threads and support communication with (threads of) the                chronous speedup scalability in terms of the number of total
other parties. We use MPI to implement the communication                 parties q. Given a fixed m, q-parties speedup is defined as
scheme. The data are partitioned vertically and randomly                                            Run time of using 1 party
into q non-overlapped parts with nearly equal number of fea-                 q-parties speedup =                                , (14)
                                                                                                    Run time of using q parties
tures. The number of threads within each parties, i.e. k, is
                                                                            ∗
set as m. We use the training dataset or randomly select 80%                    https://www.kaggle.com/datasets
                                                                            †
samples as the training data, and the testing dataset or the                    https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/

                                                                 10901
100                                                          100                                                                 100                                                          100
                                           VFB 2-SAGA(ours)                                                 VFB 2-SAGA(ours)                                                VFB 2-SAGA(ours)                                              VFB 2-SAGA(ours)
                                           VFB-SAGA                                                         VFB-SAGA                                                        VFB-SAGA                                                      VFB-SAGA
            10-1                           VFB 2-SVRG(ours)                10   -1
                                                                                                            VFB 2-SVRG(ours)                                                    2                         10-1
                                                                                                                                              10-1                          VFB -SVRG(ours)                                                   2
                                                                                                                                                                                                                                          VFB -SVRG(ours)

                                                                                                                                                                                               Sub-optimality
                                                                                                                                  Sub-optimality
 Sub-optimality

                                                              Sub-optimality
                                           VFB-SVRG                                                         VFB-SVRG                                                        VFB-SVRG                                                      VFB-SVRG
                                               2
                   -2                      VFB -SGD(ours)                                                   VFB 2-SGD(ours)                                                 VFB 2-SGD(ours)
            10                             VFB-SGD                         10-2                             VFB-SGD                                                                                       10-2                            VFB 2-SGD(ours)
                                                                                                                                              10-2                          VFB-SGD                                                       VFB-SGD
                   -3
            10                                                             10-3
                   -4                                                                                                                         10-3
            10
                                                                           10-4

            10-5                                                                                                                              10-4
                                                                           10-5                                                                                                                           10-5
                        0        50              100                                 0   100          200        300                                     0          500                1000                           0        1             2            3
                                      Time/s                                                         Time/s                                                         Time/s                                                         Time/s               104

                             (a) Data: D1                                                (b) Data: D2                                                         (c) Data: D3                                                  (d) Data: D4

Figure 3: Results for solving µ-strongly convex VFL models (Problem 13), where the number of epoches (points) denotes how
many passes over the dataset the algorithm makes.

              100                                                              100                                                                 100                                                          100
                                           VFB 2-SAGA(ours)                                                 VFB 2-SAGA(ours)                                                VFB 2-SAGA(ours)                                              VFB 2-SAGA(ours)
                                           VFB-SAGA                                                         VFB-SAGA                                                        VFB-SAGA                                                      VFB-SAGA
                   -1                                                           -1                                                                                              2
            10                             VFB 2-SVRG(ours)                10                                   2                                                           VFB -SVRG(ours)                                               VFB 2-SVRG(ours)
                                                                                                            VFB -SVRG(ours)                   10-1

                                                                                                                                 Sub-optimality
                                                              Sub-optimality

                                                                                                                                                                            VFB-SVRG
 Sub-optimality

                                                                                                                                                                                               Sub-optimality
                                           VFB-SVRG                                                         VFB-SVRG                                                                                       10-1                           VFB-SVRG
            10-2                           VFB 2-SGD(ours)                      -2
                                                                                                                2
                                                                                                            VFB -SGD(ours)                                                  VFB 2-SGD(ours)                                               VFB 2-SGD(ours)
                                           VFB-SGD                         10                               VFB-SGD                                 -2                      VFB-SGD                                                       VFB-SGD
                                                                                                                                              10
            10-3                                                           10-3                                                                                                                            10-2
                                                                                                                                                    -3
                                                                                                                                              10
            10-4
                                                                           10-4
                                                                                                                                                                                                           10-3
            10-5                                                                -5
                                                                                                                                              10-4
                                                                           10
                        0         50               100                               0         100            200          300                           0    250         500            900                          0               2                   4
                                      Time/s                                                         Time/s                                                         Time/s                                                          Time/s              104

                             (a) Data: D1                                                (b) Data: D2                                                         (c) Data: D3                                                  (d) Data: D4

Figure 4: Results for solving nonconvex VFL models (Problem 14), where the number of epoches (points) denotes how many
passes over the dataset the algorithm makes.

                                                       Algorithm                                      D1                                                 D2                         D3                                        D4
                                                     NonF                                81.96%±0.25%                      93.56%±0.19%                               98.29%±0.21%                                        92.17%±0.12%
                            Problem (13)          AFSVRG-VP                              79.35%±0.19%                      93.35%±0.18%                               97.24%±0.11%                                        89.17%±0.10%
                                                     Ours                                81.96%±0.22%                      93.56%±0.20%                               98.29%±0.20%                                        92.17%±0.13%
                                                     NonF                                82.03%±0.32%                      93.56%±0.25%                               98.45%±0.29%                                        92.71%±0.24%
                            Problem (14)          AFSVRG-VP                              79.36%±0.24%                      93.35%±0.22%                               97.59%±0.13%                                        89.98%±0.14%
                                                     Ours                                82.03%±0.34%                      93.56%±0.24%                               98.45%±0.33%                                        92.71%±0.27%

                                Table 2: Accuracy of different algorithms to evaluate the losslessness of our algorithms (10 trials).

where run time is defined as time spending on reaching a                                                                                           son is repeated 10 times with m = 3, q = 8, and a same
certain precision of sub-optimality, i.e., 1e−3 for D4 . We                                                                                        stop criterion, e.g., 1e−5 for D1 . As shown in Table 2, the
implement experiment for Problem (14), results of which are                                                                                        accuracy of our algorithms are the same with those of NonF
shown in Fig. 2. As depicted in Fig. 2, our asynchronous                                                                                           algorithms and are much better than those of AFSVRG-VP,
algorithms has much better q-parties speedup scalability than                                                                                      which are consistent to our claims.
synchronous ones and can achieve near linear speedup.
                                                                                                                                                                                      Conclusion
Evaluation of Losslessness                                                                                                                         In this paper, we proposed a novel backward updating mecha-
To demonstrate the losslessness of our algorithms, we com-                                                                                         nism for the real VFL system where only one or partial parties
pare VFB2 -SVRG with its non-federated (NonF) counterpart                                                                                          have labels for training models. Our new algorithms enable
(all data are integrated together for modeling) and ERCR                                                                                           all parties, rather than only active parties, to collaboratively
based algorithm but without BUM, i.e., AFSVRG-VP pro-                                                                                              update the model and also guarantee the algorithm conver-
posed in (Gu et al. 2020b). Especially, AFSVRG-VP also uses                                                                                        gence, which was not held in other recently proposed ERCR
distributed SGD method but can not optimize the parameters                                                                                         based VFL methods under the real-world setting. Moreover,
corresponding to passive parties due to lacking labels. When                                                                                       we proposed a bilevel asynchronous parallel architecture to
implementing AFSVRG-VP, we assume that only half par-                                                                                              make ERCR based algorithms with backward updating more
ties have labels, i.e., parameters corresponding to the features                                                                                   efficient in real-world tasks. Three practical SGD-type of
held by the other parties are not optimized. Each compari-                                                                                         algorithms were also proposed with theoretical guarantee.

                                                                                                                         10902
Acknowledgments                                     Huang, F.; Chen, S.; and Huang, H. 2019. Faster Stochastic
Q.S. Zhang and C. Deng were supported in part by the Na-               Alternating Direction Method of Multipliers for Nonconvex
tional Natural Science Foundation of China under Grant                 Optimization. In ICML, 2839–2848.
62071361, the National Key R&D Program of China un-                    Huang, F.; Gao, S.; Pei, J.; and Huang, H. 2020. Acceler-
der Grant 2017YFE0104100, and the China Research Project               ated zeroth-order momentum methods from mini to minimax
under Grant 6141B07270429.                                             optimization. arXiv preprint arXiv:2008.08170 .
                                                                       Huo, Z.; and Huang, H. 2017. Asynchronous mini-batch
                        References                                     gradient descent with variance reduction for non-convex op-
Bottou, L. 2010. Large-scale machine learning with stochas-            timization. In Thirty-First AAAI Conference on Artificial
tic gradient descent. In Proceedings of COMPSTAT’2010,                 Intelligence.
177–186. Springer.
                                                                       Johnson, R.; and Zhang, T. 2013. Accelerating stochastic
Chang, C.-C.; and Lin, C.-J. 2011. LIBSVM: A library for               gradient descent using predictive variance reduction. In Ad-
support vector machines. ACM transactions on intelligent               vances in NIPS, 315–323.
systems and technology (TIST) 2(3): 27.
                                                                       Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis,
Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; and Yang, Q.           M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.;
2019. SecureBoost: A Lossless Federated Learning Frame-                Cummings, R.; et al. 2019. Advances and Open Problems in
work. arXiv preprint arXiv:1901.08755 .                                Federated Learning. arXiv preprint arXiv:1912.04977 .
Conroy, B.; and Sajda, P. 2012. Fast, exact model selection
                                                                       Kim, M.; Song, Y.; Wang, S.; Xia, Y.; and Jiang, X. 2018.
and permutation testing for l2-regularized logistic regression.
                                                                       Secure logistic regression based on homomorphic encryption:
In Artificial Intelligence and Statistics, 246–254.
                                                                       Design and evaluation. JMIR medical informatics 6(2): e19.
Dang, Z.; Li, X.; Gu, B.; Deng, C.; and Huang, H.
2020. Large-Scale Nonlinear AUC Maximization via Triply                Leblond, R.; Pedregosa, F.; and Lacoste-Julien, S. 2017.
Stochastic Gradients. IEEE Transactions on Pattern Analysis            Asaga: Asynchronous Parallel Saga. In 20th International
and Machine Intelligence .                                             Conference on Artificial Intelligence and Statistics (AISTATS)
                                                                       2017.
Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. SAGA:
A fast incremental gradient method with support for non-               Li, M.; Deng, C.; Li, T.; Yan, J.; Gao, X.; and Huang, H. 2020.
strongly convex composite objectives. In Advances in NIPS,             Towards Transferable Targeted Attack. In Proceedings of
1646–1654.                                                             the IEEE/CVF Conference on Computer Vision and Pattern
                                                                       Recognition, 641–649.
Gascón, A.; Schoppmann, P.; Balle, B.; Raykova, M.; Do-
erner, J.; Zahur, S.; and Evans, D. 2016. Secure Linear Re-            Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous
gression on Vertically Partitioned Datasets. IACR Cryptology           parallel stochastic gradient for nonconvex optimization. In
ePrint Archive 2016: 892.                                              Advances in Neural Information Processing Systems, 2737–
                                                                       2745.
Gong, Y.; Fang, Y.; and Guo, Y. 2016. Private data analyt-
ics on biomedical sensing data via distributed computation.            Liu, F.; Ng, W. K.; and Zhang, W. 2015. Encrypted gradient
IEEE/ACM transactions on computational biology and bioin-              descent protocol for outsourced data mining. In 2015 IEEE
formatics 13(3): 431–444.                                              29th International Conference on Advanced Information Net-
                                                                       working and Applications, 339–346. IEEE.
Gu, B.; Dang, Z.; Li, X.; and Huang, H. 2020a. Federated
Doubly Stochastic Kernel Learning for Vertically Partitioned           Liu, Y.; Liu, Y.; Liu, Z.; Zhang, J.; Meng, C.; and Zheng, Y.
Data. In Proceedings of the 26th ACM SIGKDD International              2019. Federated Forest. arXiv preprint arXiv:1905.10053 .
Conference on Knowledge Discovery & Data Mining, 2483–                 McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al.
2493.                                                                  2016. Communication-efficient learning of deep networks
Gu, B.; Xu, A.; Deng, C.; and Huang, h. 2020b. Privacy-                from decentralized data. arXiv preprint arXiv:1602.05629 .
Preserving Asynchronous Federated Learning Algorithms for              Shen, X.; Alam, M.; Fikse, F.; and Rönnegård, L. 2013. A
Multi-Party Vertically Collaborative Learning. arXiv preprint          novel generalized ridge regression method for quantitative
arXiv:2008.06233 .                                                     genetics. Genetics 193(4): 1255–1268.
Hardy, S.; Henecka, W.; Ivey-Law, H.; Nock, R.; Patrini, G.;
                                                                       Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S.
Smith, G.; and Thorne, B. 2017. Private federated learning on
                                                                       2017. Federated multi-task learning. In Advances in Neural
vertically partitioned data via entity resolution and additively
                                                                       Information Processing Systems, 4424–4434.
homomorphic encryption. arXiv preprint arXiv:1711.10677
.                                                                      Wang, X.; Ma, S.; Goldfarb, D.; and Liu, W. 2017. Stochastic
Hu, Y.; Niu, D.; Yang, J.; and Zhou, S. 2019. FDML: A                  quasi-Newton methods for nonconvex stochastic optimiza-
collaborative machine learning framework for distributed               tion. SIAM Journal on Optimization 27(2): 927–956.
features. In Proceedings of the 25th ACM SIGKDD Interna-               Wang, Z.; Ji, K.; Zhou, Y.; Liang, Y.; and Tarokh, V. 2019.
tional Conference on Knowledge Discovery & Data Mining,                SpiderBoost and Momentum: Faster Variance Reduction Al-
2232–2240.                                                             gorithms. In Advances in NIPS, 2403–2413.

                                                               10903
Wei, K.; Yang, M.; Wang, H.; Deng, C.; and Liu, X. 2019.
Adversarial Fine-Grained Composition Learning for Unseen
Attribute-Object Recognition. In Proceedings of the IEEE
International Conference on Computer Vision, 3741–3749.
Xu, R.; Baracaldo, N.; Zhou, Y.; Anwar, A.; and Ludwig,
H. 2019. Hybridalpha: An efficient approach for privacy-
preserving federated learning. In Proceedings of the 12th
ACM Workshop on Artificial Intelligence and Security, 13–23.
Yang, K.; Fan, T.; Chen, T.; Shi, Y.; and Yang, Q. 2019a.
A Quasi-Newton Method Based Vertical Federated Learn-
ing Framework for Logistic Regression. arXiv preprint
arXiv:1912.00513 .
Yang, M.; Deng, C.; Yan, J.; Liu, X.; and Tao, D. 2020a.
Learning Unseen Concepts via Hierarchical Decomposition
and Composition. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, 10248–
10256.
Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019b. Federated
machine learning: Concept and applications. ACM Trans-
actions on Intelligent Systems and Technology (TIST) 10(2):
12.
Yang, X.; Deng, C.; Wei, K.; Yan, J.; and Liu, W. 2020b.
Adversarial Learning for Robust Deep Clustering. Advances
in Neural Information Processing Systems 33.
Zhang, G.-D.; Zhao, S.-Y.; Gao, H.; and Li, W.-J. 2018.
Feature-Distributed SVRG for High-Dimensional Linear
Classification. arXiv preprint arXiv:1802.03604 .
Zhang, Q.; Huang, F.; Deng, C.; and Huang, H. 2020.
Faster Stochastic Quasi-Newton Methods. arXiv preprint
arXiv:2004.06479 .

                                                           10904
You can also read