Communication-Efficient Frank-Wolfe Algorithm for Nonconvex Decentralized Distributed Learning

Page created by Rene Alvarez
 
CONTINUE READING
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

  Communication-Efficient Frank-Wolfe Algorithm for Nonconvex Decentralized
                            Distributed Learning
                                     Wenhan Xian 1 , Feihu Huang 1 , Heng Huang 1,2
                1
                    Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, USA
                                  2
                                    JD Finance America Corporation, Mountain View, CA, USA
                              wex37@pitt.edu, huangfeihu2018@gmail.com, heng.huang@pitt.edu

                            Abstract                                         machine learning due to its communication efficiency com-
                                                                             pared with the centralized fashion. Specifically, decentral-
  Recently decentralized optimization attracts much attention
                                                                             ized optimization adopts a pattern that each node maintains
  in machine learning because it is more communication-
  efficient than the centralized fashion. Quantization is a                  its own local data and model and only communicates with its
  promising method to reduce the communication cost via cut-                 neighbors. In fact, the communication complexity of decen-
  ting down the budget of each single communication using                    tralized system at each iteration is dependent on the degree
  the gradient compression. To further improve the communi-                  of graph topology (usually not dependent on the number of
  cation efficiency, more recently, some quantized decentral-                nodes) and data locality is allowed.
  ized algorithms have been studied. However, the quantized                     Recently, the decentralized learning becomes a popular
  decentralized algorithm for nonconvex constrained machine                  research topic in machine learning and has been widely
  learning problems is still limited. Frank-Wolfe (a.k.a., con-
                                                                             studied. For example, it was first studied to solve problems
  ditional gradient or projection-free) method is very efficient
  to solve many constrained optimization tasks, such as low-                 of computing aggregates among clients or be used for the
  rank or sparsity-constrained models training. In this paper, to            sake of data locality and privacy (Ram, Nedić, and Veer-
  fill the gap of decentralized quantized constrained optimiza-              avalli 2010; Yan et al. 2012), where centralized structure
  tion, we propose a novel communication-efficient Decentral-                is not allowed. A general decentralized algorithm can be
  ized Quantized Stochastic Frank-Wolfe (DQSFW) algorithm                    traced back to (Nedić and Ozdaglar 2009) that combines
  for non-convex constrained learning models. We first design                gradient descent method and Gossip-type consensus step.
  a new counterexample to show that the vanilla decentralized                Subsequently, a degenerated case of decentralization that
  quantized stochastic Frank-Wolfe algorithm usually diverges.               achieves huge success is federated optimization (Konečný
  Thus, we propose DQSFW algorithm with the gradient track-                  et al. 2016), which adopts star topology but enables data to
  ing technique to guarantee the method will converge to the
                                                                             be drawn from non-iid distribution. Besides, many other pre-
  stationary point of non-convex optimization safely. In our the-
  oretical analysis, we prove that to achieve the stationary point           vious fully decentralized works such as (Lian et al. 2017;
  our DQSFW algorithm achieves the same gradient complex-                    Tang et al. 2018) that based on general network topology
  ity as the standard stochastic Frank-Wolfe and centralized                 have shown that decentralized method is able to achieve
  Frank-Wolfe algorithms, but has much less communication                    more efficient communication without sacrificing the train-
  cost. Experiments on matrix completion and model compres-                  ing result, indicating that decentralization is becoming com-
  sion applications demonstrate the efficiency of our new algo-              petitive and advantageous in distributed learning rather than
  rithm.                                                                     just an alternate of centralization when centralization is not
                                                                             allowed. Lian et al. (2017) presented an important decen-
                         Introduction                                        tralized optimization work to verify that the decentralized
                                                                             method can outperform its centralized counterpart. Lian
Nowadays, many machine learning tasks have been de-                          et al. (2017) proposed an algorithm named Decentralized
ployed on distributed systems that enable computations in                    Parallel Stochastic Gradient Descent (D-PSGD) to directly
parallel, especially for large-scale models such as deep neu-                compute the averaging value among each node with ex-
ral networks (DNNs). Recently, the centralized distribu-                     act communication, which has the same convergence rate
tion structure has been used often. However, the central-                    as centralized SGD in nonconvex optimization with non-
ized learning scheme has a key bottleneck of communica-                      identical data distribution.
tion, where the communication burden of the central server
                                                                                To reduce the communication cost in distributed system,
becomes larger as the number of nodes grows. For example,
                                                                             gradient quantization (Seide et al. 2014) is another effec-
when the system has M workers, it will suffer from the com-
                                                                             tive method. Recently, many quantized gradient algorithms,
munication complexity of O(M ). Thus, the decentralized
                                                                             such as QSGD (Alistarh et al. 2017), signSGD (Bernstein
distribution structure recently has attracts much attention in
                                                                             et al. 2018a) and its variant (Bernstein et al. 2018b), were
Copyright © 2021, Association for the Advancement of Artificial              developed and showed excellent performance. In these algo-
Intelligence (www.aaai.org). All rights reserved.                            rithms, the number of bits transmitted in each communica-

                                                                     10405
tion round is reduced by packing and unpacking gradients.              Quantized Stochastic Frank-Wolfe (DQSFW) algorithm to
Alistarh et al. (2017) proposed an unbiased quantization               solve the problem (1) using the gradient tracking technique
scheme and proved it is capable to converge under convex               to guarantee the DQSFW can safely and fast converge to
and non-convex conditions. However, for other quantization             the stationary point in non-convex optimization. Specifi-
method like 1-bit quantization or signSGD, the unbiased as-            cally, our DQSFW algorithm uses 1-bit gradient quantiza-
sumption is not always satisfied. Karimireddy et al. (2019)            tion scheme. In summary, the main contributions of this
proved that when applying signSGD with a scalar factor                 paper are given as follows:
and error-feedback technique, the algorithm is guaranteed to         (1) We propose a novel efficient Decentralized Quantized
converge in non-convex optimization. More recently, to fur-               Stochastic Frank-Wolfe (DQSFW) method to solve the
ther achieve communication-efficiency, multiple quantized                 problem (1) with less communication cost but still good
decentralized algorithms (Doan, Maguluri, and Romberg                     convergence speed.
2018; Reisizadeh et al. 2019a,b; Tang et al. 2019; Koloskova
et al. 2020) have been introduced. However, to the best of           (2) We derive the rigorous theoretical analysis for our
our knowledge, the existing quantized decentralized algo-                 DQSFW algorithm, and prove that our DQSFW algorithm
rithm for constrained problem is still very limited. In fact,             has the same gradient complexity O(−4 ) as the SFW
the large-scale constrained optimization problems are pop-                (Reddi et al. 2016) (the sequential algorithm) and QFW
ular in many machine learning applications, such as matrix                (Zhang et al. 2019) (the centralized algorithm), but with
completion and deep neural network compression.                           much less communication cost.
   To address this challenging issue, in this paper, we focus        (3) We provide a new intuitive counterexample to show that
on studying the quantized decentralized algorithm for solv-               the decentralized optimization involving non-linear pro-
ing the following constraint optimization problem:                        jection of gradient could lead to a potential divergent
                              M                                           problem which also exists in many cases where we gen-
                           1 X                                            eralize other non-Frank-Wolfe methods to decentralized
                     min         fi (x),                  (1)
                     x∈Ω   M i=1                                          algorithms. To tackle this challenge, we utilize the gra-
                                                                          dient tracking technique to guarantee the convergence of
where fi (x) is a nonconvex smooth loss function, Ω is a con-             our decentralized quantized Frank-Wolfe algorithm.
vex and compact constraint set, M is the number of worker
nodes. fi (x) is the objective function on node i and could              Notations
have the stochastic expectation or finite sum formulations:              k·k1 denotes one norm of vector. k·k2 denotes spectral norm
                                                                         of matrix. k·kF denotes Frobenius norm of matrix. k·k∗ de-
                   (
                     Eξ∼D F (i) (x; ξ), stochastic
         fi (x) = 1 Pni i        (i)                      (2)            notes trace norm of matrix. Let 1 be the column vector, each
                     ni   j=1 Fj (x), finite-sum                         entry of which is one. Given a network with M nodes, we
where Di is the data distributed on i-th node. Apparently,               define a mixing matrix W = (wij ) ∈ RM ×M that repre-
finite-sum objective function is a particular case of stochas-           sents the weights of neighbors in the communication round.
tic problem where Di consists of finite samples. We allow                For example, in D-PSGD (Lian et al. 2017), the consensus
distributions Di to be non-identical, which is more adaptive             step on i-th node is formulated as
to general tasks in machine learning and is assumed in many                                             M
                                                                                                 (i)               (j)
                                                                                                        X
previous decentralized analyses (Lian et al. 2017; Tang et al.                                   xt =         wij xt .
2018; Lian et al. 2018).                                                                                i=1
   To solve the above constrained optimization, the Frank-
Wolfe (a.k.a, conditional gradient or projection-free) method            Generally, W is a symmetric doubly stochastic matrix that
is one of the most efficient and popular algorithms, be-                 satisfies W T = W and W 1 = 1. In the experiment section,
cause the Frank-Wolfe method only requires to compute a                  we will consider a uniformly weighted ring-based network,
linear oracle instead of the expensive projection operator               whose mixing matrix is shown as Eq. (3).
applied in proximal gradient methods (Ghadimi, Lan, and                               1/3
                                                                                       
                                                                                                  1/3     0     ···        0     1/3
                                                                                                                                     
Zhang 2016) and alternating direction method of multipliers                         1/3          1/3   1/3       0      ···      0 
(Huang, Chen, and Huang 2019). In this paper, thus, we fo-                           0
                                                                                                 1/3   1/3     1/3        0     ···
cus on designing the communication-efficient quantized de-                       W =
                                                                                     ..                 ..      ..       ..            (3)
                                                                                     .                     .       .        .
                                                                                                                                     
centralized Frank-Wolfe algorithm to solve the above prob-                                                                           
lem (1). It is nontrivial to design such an algorithm. We
                                                                                     0           ···     0     1/3      1/3     1/3
                                                                                           1/3     0    ···       0      1/3     1/3
first provide a counterexample to show that the vanilla quan-
tized decentralized Frank-Wolfe algorithm usually diverges                                       Related Works
(please see the following Counterexample section). Thus,
there exists an important research problems to be addressed:             Decentralized Frank-Wolfe
   Can we design a communication-efficient quantized de-                 Decentralized Frank-Wolfe algorithm (DeFW) (Wai et al.
centralized Frank-Wolfe algorithm with convergence guar-                 2017) is recently proposed to apply deterministic Frank-
antee for non-convex optimization?                                       Wolfe method in decentralized structure. It is guaranteed to
   In this paper, we answer the above challenging question               converge in both convex and non-convex problems. The au-
with positive solution and propose a novel Decentralized                 thors compute net averaging on parameters and gradients.

                                                                 10406
Algorithm     DeFW                QFW          DQSFW
                   √                               √
 Decentralized                        ×
                                      √            √
  Stochastic       ×                  √            √
  Quantized        ×
  Reference (Wai et al. 2017) (Zhang et al. 2019) Ours

            Table 1: Comparison of related works.

In previous work about decentralization such as (Lian et al.
2017), they do not have to calculate the averaging of gradi-
ents. Compared with DeFW, our DQSFW changes the deter-                          Figure 1: A graph to demonstrate the counter-example.
ministic algorithm to a stochastic one, which is more adap-
tive to large scale machine learning tasks. When the number
of samples is very large, the full gradient is too expensive to                                   Counterexample
calculate. Besides, we also take advantage of the technique
                                                                              In this section, we provide an intuitive counterexample that
of gradient quantization, which will further reduce the cost
                                                                              indicates the divergent trap if Frank-Wolfe method is sim-
of communication.
                                                                              ply generalized to the decentralized algorithm without mak-
Quantized Frank-Wolfe                                                         ing consensus on gradient when data at different nodes are
                                                                              drawn from non-identical distributions.
Quantized Frank-Wolfe algorithm (QFW) (Zhang et al.                              Given f (x) = f1 (x) + f2 (x), where x ∈ Ω√ =
2019) is recently proposed to solve the centralized dis-
                                                                              {(x, y)|x2 + y 2 ≤ 1}, f1 (x) = x and f2 (x) = 3y
tributed problems. It uses the the following momentum
                                                                              (See Figure √ 1). We can calculate
                                                                                                           √          gradients v1 = (1, 0),
scheme as gradient estimator:
                                                                              v2 = (0, 3), v = (1, 3). Since the gradient is never
                ḡt = (1 − ρt )ḡt−1 + ρt gt                   (4)            equal to 0, according to the Frank-Wolfe√
                                                                                                                             gap (see Eq. (9)),
                                                                                                                1       3
which is also used in (Locatello et al. 2019) as a way to                     the only stationary point is (− 2 , − 2 ) (blue point), where
                                                                                                                                   √
decrease the noise of gradient. In our algorithm, we com-                     the tangent of unit ball is vertical to direction (1, 3). How-
bine this momentum scheme with the Gossip method. For                         ever, if we update x by Frank-Wolfe algorithm on each node
gradient compressing, they adopt the s-Partition Encoding                     separately, the linear oracle will yield d1 = (−1, 0) and
Scheme, which encodes the i-th coordinates gi into an ele-                    d2 = (0, −1). Then we make consensus on x and get itera-
ment from set {±1, ± s−1              1
                        s , · · · , ± s , 0}. It requires log2 (s +
                                                                              tion formula xi+1 = (1 − γ)xi + γ(− 21 , − 12 ). Sequence xi
1) bits to transfer each coordinate of the gradient. A scalar                 eventually converges to point (− 21 , − 12 ) (red point), which
factor kgk∞ is also transmitted thus the total bits of quan-                  is not a stationary point.
tized gradient is 32+d·log2 (s+1) where d is the dimension                       It is reasonable to credit the divergence to the non-
of gradient. When s is large, the variance of this compres-                   commutative relation between linear oracle and addition.
sor will become small (Zhang et al. 2019), which means the                    For SGD based decentralized learning algorithms, they can
quantized gradient is more precise, but costs more bits. In                   converge well because of the commutative property of addi-
this paper, we use 1-bit signSGD scheme with a scalar fac-                    tion. The above divergence problem is also likely to happen
tor (Karimireddy et al. 2019; Koloskova, Stich, and Jaggi                     to other variant algorithms of SGD that involves non-linear
2019) shown as following:                                                     map of gradients in decentralized system, not just Frank-
                           kxk1                                               Wolfe type methods. For example, adaptive gradient method
                  C(x) =        sign(x)                  (5)                  is a family of algorithms that adjust the learning rate accord-
                             d                                                ing to the magnitude of gradient. The Decentralized ADAM
where d is the dimension of x. Notice that here signSGD is                    algorithm (DADAM) (Nazari, Tarzanagh, and Michailidis
only a representative of feasible compressors. We can also                    2019) was proved to converge under the criterion named lo-
use other compressor as long as it satisfies the Compressor                   cal regret. Nonetheless, local regret probably leads to a result
Assumption in section 4, which is an important assumption                     that each node converges to its own local stationary point,
in many theoretical analysis of related work about gradi-                     while they do not cooperate well enough as an entire sys-
ent quantization. For example, we can also use top-k SGD,                     tem. This phenomenon reminds us when we generalize an
which is a gradient sparsification method that automatically                  algorithm with steps that are not commutative to addition,
satisfies the Compressor Assumption. We consider signSGD                      similar divergence problem is likely to come out.
because it is efficient and convenient to implement.                             In DeFW (Wai et al. 2017), the gradients can be averaged
   The comparisons between DeFW, QFW and our DQSFW                            directly by gradient tracking, a technique to accelerate con-
are summarized in Table 1. We can see that our DQSFW al-                      sensus in distributed optimization (Xu et al. 2015; Nedić,
gorithm is the first work to incorporate stochastic gradient                  Olshevsky, and Shi 2017). DIGing also considers the incre-
descent and gradient quantization in decentralized Frank-                     ment of gradient when averaging the non-quantized full gra-
Wolfe type algorithm.                                                         dient. However, in this paper we have to face the variance

                                                                      10407
of stochastic gradient and the noise of quantization. These                   Algorithm 1 Decentralized Quantized Stochastic Frank-
issues do not occur in DeFW. Therefore, we have to use a                      Wolfe (DQSFW)
new strategy to let them make consensus gradually.                            Input: restricted domain Ω, matrix W , initial point
                                                                              X̂0 = X0 ∈ Ω
    New Decentralized Quantized Stochastic                                    Parameter: ηt , γt , βt , αt , T
           Frank-Wolfe Algorithm                                              Output: x̄t̂ , where t̂ is chosen uniformly from
In this section, we propose a novel efficient Decentralized                   {0, 1, . . . , T }
Quantized Stochastic Frank-Wolfe (DQSFW) algorithm to                          1: On i-th node:
solve the problem (1) by using the gradient tracking tech-                     2: for t = 0, 1, . . . , T − 1 do
                                                                                                                             (i)
nique (Xu et al. 2015; Wai et al. 2017). The DQSFW algo-                       3:     Compute an estimation of the gradient gt
rithm is given in Algorithm 1.                                                 4:     Update vt
                                                                                                 (i)                   (i)       (i)
                                                                                                          = (1 − βt )vt−1 + βt gt +
                     (i)
   In Algorithm 1, xt is a column vector that denote the                                    P        (j)     (i)
                                                                                      αt j wij (v̂t−1 − v̂t−1 )
model parameter on i-th node at iteration t. We use upper
                                                                                                       (i)                (i)         (i)
case Xt to represent the matrix                                                5:        Compute qt = C(vt − v̂t−1 ) and communicate
                          (1)      (2)                (M )
                                                                                         with neighbors
                 Xt = [xt , xt , . . . , xt                  ]                 6:
                                                                                                          (j)      (j)     (j)
                                                                                         Update replica v̂t = v̂t−1 + qt for neighbor j
Inspired by Choco-Gossip algorithm (Koloskova, Stich, and                                                              (i)          (i)
                                                                               7:        Calculate linear oracle dt such that dt          =
                                            (i)     (i)                                                       (i)
Jaggi 2019), we also define a replicated x̂t of xt on each                               arg maxd∈Ω hd, −vt i
node. The reason is when we apply gossip update, the ex-                                           (i)                         (i)    (i)
                                                                               8:        Update xt+1        = (1 − ηt )xt + ηt dt +
act value of model parameter on other nodes are unknown                                    P         (j)       (i)
since there exists quantization and the communication is in-                             γt j wij (x̂t − x̂t )
                     (i)                      (i)                                                      (i)                (i)           (i)
exact. The replica x̂t is an estimation of xt , which is also                  9:   Compute zt = C(xt+1 − x̂t ) and communicate
updated at each iteration. And the consensus step is formu-                         with neighbors
lated as line 8 in Algorithm 1. According to the update of                    10:
                                                                                                     (j)     (j)  (j)
                                                                                    Update replica x̂t+1 = x̂t + zt for neighbor j
  (i)
x̂t (line 9 and line 10 in Algorithm 1), on all neighbors of                  11: end for
the i-th node, the replica is added by an indentical transmit-
               (i)                              (i)
ted message zt , which implies the values x̂t on all neigh-                    (i)                    (i)           (i)                (i)       (i)
                                                         (i)                  xt should be xt+1 = xt + ηt (dt − xt ). For conve-
bors of the i-th node are the same. Therefore, replica x̂t is
                                                                              nience, we define matrices
well-defined. Similar to Xt , we also define matrices
                                                                                                                   (1)      (2)               (M )
                            (1)      (2)               (M )                                     Vt          =   [vt , vt , . . . , vt                ],
              X̂t   =    [x̂t , x̂t , . . . , x̂t                ],                                                (1) (2)             (M )
                                                                                                V̂t         =   [v̂t , v̂t , . . . , v̂t ],
              Xt    =    [x̄t , x̄t , . . . , x̄t ]
                                                PM (i)                                         Vt           =   [v̄t , v̄t , . . . , v̄t ],
                                             1
where x̄t represent the mean value: x̄t = M         i=1 xt .                                                       (1)      (2)               (M )
     (i)
                                                                                               Dt           =
                                                                                                       [dt , dt , . . . , dt ],
   gt is a stochastic gradient on i-th node calculated by
                           (i)                                                                Dt = [d¯t , d¯t , . . . , d¯t ]
selected samples and vt is our key estimation of the gra-
dient on i-th node which is defined as Eq. (6) with a kind                    where v̄t and d¯t are mean values
                                   (i)                    (i)                                       M                          M
of momentum scheme. Here v̂t is the replica of vt (see                                           1 X (i)                     1 X (i)
                                                                                          v̄t =         vt , d¯t =               d
                       (i)                               (i)
similar concept of x̂t ). For initialization, we set v̂−1 = 0                                   M i=1                      M i=1 t
        (i)    (i)
and v−1 = g0 . The definition and role of βt will be dis-                        By the doubly stochastic property of W , we have
cussed later in Remark 2. Our convergence analysis shows                                        X t+1 = (1 − ηt )X t + ηt Dt             (7)
that though our gradient estimator in line 4 is biased, the gra-
dients on all nodes are getting close to the full gradient uni-               It is easy to verify that when x0 ∈ Ω, x̄t ∈ Ω for ∀t. Hence
formly. To make consensus on gradient and parameter, we                       the constraint is always satisfied. Here we should notice that
adopt the gossip update (Koloskova, Stich, and Jaggi 2019)                    we do not have to store all the replica in practice.
                                                                                                  P         (j)    (i)
in line 4 and line 8 respectively.                                               We can regard j wij (x̂t − x̂t ) as a term, and obtain
   (i)             (i)         (i)
                                       X        (j)     (i)                   iteration
                                                                                   X formula
 vt = (1−βt )vt−1 +βt gt +αt             wij (v̂t−1 − v̂t−1 ) (6)                             (j)     (i)
                                                                                                              X          (j)   (i)
                                                                                       wij (x̂t+1 − x̂t+1 ) =     wij (x̂t − x̂t )
                                              j
                                                                                     j                                       j
   In line 5 and line 9 of Algorithm 1, we apply a gradient                                                                      X             (j)        (i)
quantization method that satisfies the Compressor Assump-                                                                 +          wij (zt          − zt ).   (8)
                                                                                                                                 j
tion. As mentioned previously, the quantization scheme is
not limited to the signSGD used in this paper. Line 7 is the                  Therefore, we only need one buffer with the size of xt to
                                                                                                                         (j)     (i)
typical linear oracle in Frank-Wolfe method to get a direc-
                                                                                                                P
                                                                              compute this term. So it is with j wij (v̂t − v̂t ). We
       (i)
tion dt . In vanilla Frank-Wolfe algorithm, the update of                     will use Eq. (8) to save memory in our experiments.

                                                                      10408
Convergence Analysis                                           Next, we will propose the main theorem of our conver-
In this section, we study the convergence properties of our                     gence analysis. Please check the detailed proof in supple-
DQSFW algorithm. All proofs can be found in Supplemen-                          mentary material section B.
tary Material. We begin with introducing some mild assump-                      Theorem 1. Let Q, R1 , R2 and S be the constants defined
tions and the definition of Frank-Wolfe gap (Jaggi 2013):                       in Lemma 1 to Lemma 3. Step size ηt is set as Lemma 1.
             G(x) = maxhv − x, −∇f (x)i                          (9)            Then we have
                       v∈Ω                                                                                                p
                                                                                                f (x̄t − f (x̄t+1 )  D 2(S + QR1 L2 D2 )
The convergence criteria is E[G(x)] ≤ .                                        E[G(x̄t )] ≤ E                      ) +
                                                                                                         ηt                 (t + 1)θ/3
Assumption 1. (Lipschitz Gradient) There is a constant L                                           √
such that for ∀i ∈ {1, 2, . . . , M }, we have                                                  D QR2 G ηt L2 D2
                                                                                             +                 +        .
                                                                                                 (t + 1)θ/3          2
           k∇fi (x) − ∇fi (y)k ≤ Lkx − yk.                      (10)
                                                                                Theorem 2. Suppose T iterations have been completed.
Assumption 2. (Compact Domain) There is a diameter D
                                                                                Let t̂ is chosen randomly with identical probability from
of domain Ω.
                                                                                {0, 1, . . . , T }. Set θ = 34 . Then by Theorem 1 we can obtain
Assumption 3. (Lower Bound) Function f (x) has the lower
bound inf x∈Ω f (x) = f − > −∞.                                                                                         1
                                                                                                   E[G(x̄t̂ )] = O(         ).
Assumption 4. (Spectral Gap) Given the symmetric dou-                                                                 T 1/4
bly stochastic matrix W , we define λ1 , λ2 , . . . , λM to be its              Remark 1. Theorem 2 shows that our DQSFW algo-
eigenvalues in descending order. Then max{|λ2 |, |λM |} <                       rithm reach a gradient complexity of O(−4 ) to achieve
1. Let ρ = max{|λ2 |, |λM |} and ζ = 1 − λM .                                   -accuracy stationary point. And the Frank-Wolfe gap is
Assumption 5. (Compressor Assumption) Compressor C(·)                           asymptotic to 0, rather than a neighborhood of which the
satisfies kC(x) − xk2 ≤ (1 − δ)kxk2 , where 0 < δ ≤ 1.                          size is dependent on . This is because all parameters in our
Assumption 6. (Bounded Gradient and Bounded Variance)                           algorithm are independent of , while in SFW, step size and
                                      (i)                   (i)                 number of iterations are functions of .
The generated gradient estimator gt satisfies E[gt ] =
        (i)      (i)       (i) 2                        (i)                     Remark 2. θ = 34 is the best trade-off between consensus
∇fi (xt ), Ekgt −∇fi (xt )k ≤ σ 2 , k∇F (i) (xt ; ξ)k ≤
                                                                                and stepsize. If βt is too large, the noise of quantization and
G.
                                                                                the variance of stochastic gradient will cause bad consensus
   Based on above assumptions, we demonstrate three lem-                        and then affect the convergence. If βt is too small, the step-
mas of our DQSFW algorithm. Lemma 1 and lemma 2 es-                             size should also be small. Otherwise the averaged gradient
timate the consensus between model parameter Xt and gra-                        cannot catch up with the changing of x, which will cause
dient Vt respectively. Lemma 3 implies that our gradient es-                    slow convergence. This trade-off is the challenge and intu-
timator Eq. (6) on different node approaches the value of                       ition to define our gradient estimator as Eq. (6).
full gradient uniformlly and gradually. The detailed proof of
lemmas can be found in supplementary material section A.                        Remark 3. From the proof in supplementary material we
                            √                                                   can see the theoretical framework does not only work for
Lemma 1. Let δ0 = 1 − 1 − δ 2 , δ1 = 1 − (1 − δ02 )2 .                          signSGD, but also all compressors that satisfy Assumption
αt = γt = γ = min{1, δζ0 , (1−ρ)δ
                               2ζ 2 }. c1 = (1−ρ)γ, c2 = δ,
                                   1
                                                                                5.
c3 = min{ (1−ρ)γ  2   , δ21 }. A = (1 + c1 )(1 − (1 − ρ)γ)2 + (1 −
δ)(1+ c12 )γ 2 ζ 2 , B = (1+ c11 )γ 2 ζ 2 +(1−δ)(1+c2 )(1+γζ)2 .                                 Experimental Results
                                                    c33                         To validate the efficiency of our new DQSFW algorithm, we
Let Q = 8(1 + c13 )(A + B). Set η0 = 16(1+c3 )(A+B) and
         η0                                                                     conduct the experiments on two constrained machine learn-
ηt = (t+1)  θ , 0 < θ < 1. Then there exists a constant R1                      ing applications: matrix completion and model compression.
satisfying
                                              QR1 M D2                          Decentralized Low-Rank Matrix Completion
        kXt − Xt k2F + kXt − X̂t k2F ≤                                          Low-rank matrix completion is a model to solve a broad
                                              (t + 1)2θ
                                                                                range of learning tasks, such as collaborative filtering (Ko-
Lemma 2. Let c4 = c23 , β0 = B(1+c
                                 Ac4
                                     4)
                                        and βt =             β0
                                                          (t+1)2θ/3
                                                                    .           ren, Bell, and Volinsky 2009) and multi-label learning (Xu,
Then there exists constant R2 such that                                         Jin, and Zhou 2013). The loss function of low-rank matrix
                                                                                completion problem has the following form:
                                             QR2 M G2
         kVt − Vt k2F + kVt − V̂t k2F ≤                                                        X
                                             (t + 1)2θ/3                              min           φ(Xij − Yij ), s.t. kXk∗ ≤ C
                                                                                   X∈RM ×N
                                 PM    (i)                                                   (i,j)∈Ω
                             1
Lemma 3. Denote v̄t =        M    i=1 vt . There exists constant
S such that                                                                     where φ : R → R is the potential non-convex empirical loss
                       M                                                        function. Y is the target matrix and Ω is the set of observed
                   1   X          (i)             S                             entries. (Wai et al. 2017) also conducts this experiment with
         Ekv̄t −             ∇fi (xt )k2 ≤
                   M   i=1
                                             (t + 1)2θ/3                        MSE loss function and robust Gaussian loss function. In our

                                                                        10409
10 4                                                            10 4                                                                     10 4
      2.5                                                             2.5                                                                      2.5

       2                                                               2                                                                        2

      1.5                                                             1.5                                                                      1.5

       1                                                               1                                                                        1

      0.5                                                             0.5                                                                      0.5
            0      500    1000 1500 2000 2500 3000 3500 4000                0          1000    2000     3000   4000     5000    6000   7000          0          2000    4000       6000    8000   10000

                                   (a)                                                                   (b)                                                              (c)

                10 5                                                            10 5                                                                     10 5
      2.5                                                             2.5                                                                      2.5

       2                                                               2                                                                        2

      1.5                                                             1.5                                                                      1.5

       1                                                               1                                                                        1

      0.5                                                             0.5                                                                      0.5
            0      1000   2000   3000   4000   5000   6000   7000           0           2000     4000      6000       8000     10000   12000         0            0.5          1          1.5        2
                                                                                                                                                                                                  10 4

                                   (d)                                                                   (e)                                                               (f)

Figure 2: The experimental results of decentralized low-rank matrix completion for dataset MovieLens 100k and MovieLens
1m. Figure (a), (b) and (c) show the training loss with respect to bits transferred on MovieLens 100k with 20, 40 and 60 workers
respectively. Figure (d), (e) and (f) show the training loss with respect to bits transferred on MovieLens 1m with 20, 40 and 60
workers respectively.

                   Name                        User           Movie   Record                                        For both datasets, we deploy our experiment on M =
     MovieLens 100k                            943             1682   100000
                                                                                                                  20, 40, 60 MPI worker nodes respectively by mpi4py. Each
     MovieLens 1m                              6040            3952   1000209                                     node is an Intel Xeon E5-2660 machine within an infini-
                                                                                                                  band network. We assign 1/M of the rating records to each
                                                                                                                  worker. For MovieLens 100k, we set C = 2000 while for
       Table 2: Descriptions of MovieLens Datasets.                                                               MovieLens 1m we set C = 5000.
                                                                                                                     In this task, the linear oracle can be obtained by singular
                                                                                                                                                                   (i)
experiment, to verify that our algorithm works well for non-                                                      value decomposition (SVD). Let the SVD of vt be U · S ·
                                                                                                                    T                                           T
convex objective functions, we adopt the robust Gaussian                                                          V . Then the linear oracle d = −C · u · v where u and v
loss function                                                                                                     are the singular vectors corresponding to the largest singular
                                                                                                                  value (also named leading vectors of SVD). In practice, we
                                        σ0           z2                                                           only need to compute the leading vectors, while in projected
                         φ(z) =            (1 − exp(− ))                                       (11)
                                        2            σ0                                                           algorithms we have to do the completed SVD.
   In our experiment, parameter σ0 is fixed to be 2. We                                                              We choose two other projection-free methods DeFW (Wai
run our experiment on two benchmark datasets, MovieLens                                                           et al. 2017) and QFW (Zhang et al. 2019) as baseline
100k and MovieLens 1m (Harper and Konstan 2015). Both                                                             methods. For decentralized algorithms, we use a ring-based
of two datasets are records of movie ratings from plenty of                                                       topology as the communication network because it is con-
users, and are usually used to train recommendation sys-                                                          venient to implement and achieves linear speedup in com-
tems. The descriptions of these two datasets are shown in                                                         munication (Lian et al. 2017). For QFW, s of the s-partition
Table 2. MovieLens 100k has 943 users, 1682 movies, and                                                           encoding is set to be 1. For all of the three algorithms, we set
100000 rating records. MovieLens 1M has 6040 users, 3952                                                          step size ηt = t−0.75 . The results of low-rank matrix com-
movies, and 1000209 rating records. All ratings vary from                                                         pletion on MovieLens 100k and MovieLens 1m are shown in
0 to 5. We scale them to interval [0, 1]. The rating records                                                      Figure 2. As many previous quantization work did (includ-
can be converted into matrix, where row represents user id                                                        ing QFW), we analyze the experimental results with respect
and column represents movie id. Each record serves as an                                                          to bits transferred, which means the number of bits sent or
observation. As our purpose is to verify the performance of                                                       received on the busiest node. For decentralized algorithms it
optimization algorithm, we take all data for training.                                                            can be any node and for centralized algorithm it is the mas-

                                                                                                      10410
1.2                                                                 1.2

                                                                                  1.1
               1
                                                                                   1

                                                                                  0.9
              0.8
                                                                                  0.8

              0.6                                                                 0.7

                                                                                  0.6
              0.4
                                                                                  0.5

                                                                                  0.4
              0.2
                                                                                  0.3

               0                                                                  0.2
                    0   500    1000     1500   2000   2500                              0   1000   2000   3000    4000   5000   6000   7000   8000

                                  (a)                                                                            (b)

Figure 3: The experimental result of decentralized model compression for VGG11 neural network on dataset CIFAR 10. Figure
(a) shows the training loss against time. Figure (b) shows the training loss against the number of bits transferred

ter node. The number is divided by the size of x since it is                GTX1080 GPUs by Pytorch. Each GPU is treated as a single
always proportional to the size of x.                                       worker. The communication is based on NVIDIA NCCL.
   From the experimental results we can see that our                           We consider DeFW and QFW as baseline methods and
DQSFW algorithm achieves the best performance on both                       ring-based topology as communication network. For DeFW
datasets. Moreover, we can see that our algorithm becomes                   and our DQSFW, the decentralized system is uniform weight
more competitive as the number of workers becomes larger,                   ring network. For QFW, s is set to be 1. For all three algo-
which verifies the scalability of our method. With more                     rithms, step size is chosen as ηt = 12 t−0.75 . Because of the
workers, more sample gradients can be computed, but the                     limitation of CUDA memory, we cannot compute the full
number of gradients computed by DeFW keeps the same.                        gradient for DeFW. We calculate 1/5 of the full gradient
                                                                            instead. This issue also indicates the limitation of DeFW al-
Model Compression                                                           gorithm.
Deep neural networks (DNNs) have achieved remarkable                           The experimental results are visualized in Figure 3. To
performance in many fields at recent years. But one of its                  validate the efficiency of our algorithm, we compare the loss
shortages is the high cost of a large number of parameters.                 with respect to the bits transmitted. Similar to the matrix
Thus, many attempts have been made to reduce the num-                       completion experiment, the number of bits transferred is di-
ber of parameters in DNNs, such as dropout and pruning.                     vided by the size of parameter. For decentralized algorithms,
Among these methods, one solution is to add constraint on                   the number is counted on any node, while for centralized al-
parameters to make them sparse compulsively (Liu et al.                     gorithm it is counted on master node. According to the re-
2015; Bietti et al. 2019). In (Bietti et al. 2019), one popu-               sults, we can see that DeFW is almost infeasible for this task.
lar method was proposed via adding the spectral norm con-                   From the view of time, QFW and DQSFW have similar per-
straint as follows:                                                         formance. From the view of bits transferred, our DQSFW
                                                                            has the best performance among the three algorithms, which
                        kWl k2 ≤ τ                           (12)           verifies the superior performance of our new algorithm.
for each layer l. Because the model compression is attract-
ing increasing attentions in both machine learning research                                                 Conclusion
and applications, in this experiment, we solve the model                    In this paper, we proposed a new Decentralized Quantized
compression problem using the decentralized learning set-                   Stochastic Frank-Wolfe (DQSFW) algorithm to solve the
ting, and validate the performance of different decentralized               non-convex constrained optimization problem. We revealed
quantized algorithms on this task. Here the linear oracle is                a potential divergence problem that is likely to occur in the
                                                      (i)
d = −τ · U · V T , where U · S · V T is the SVD of vt . This                general decentralized training, not just for Frank-Wolfe type
result can be achieved easily by the fact that trace norm and               methods, and also provided a solution by making consensus
spectral norm are dual norms.                                               on gradient. We derived the new theoretical analysis to prove
   In our experiment, we run this task to compress VGG11                    that our algorithm can achieve the same gradient complexity
network on CIFAR 10 dataset, which has 50000 training                       O(−4 ) as the Stochastic Frank-Wolfe (SFW) method with
samples and 10 labels, with constraint (12). We conduct this                much less communication cost, and the Frank-Wolfe gap is
task on decentralized settings where data are distributed on                asymptotic to 0. The experimental results on two machine
different nodes to verify our algorithms. Following (Bietti                 learning applications, matrix completion and deep neural
et al. 2019), we use cross-entropy loss function as the crite-              network compression, validate the superior performance of
rion and set τ = 0.8. The experiment is implemented on 8                    our new algorithm.

                                                                    10411
Acknowledgements                                       Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factoriza-
                                                                        tion techniques for recommender systems. Computer 42(8):
We thank the anonymous reviewers for their helpful com-
                                                                        30–37.
ments. We also thank the IT Help Desk at the University of
Pittsburgh. This work was partially supported by NSF IIS                Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; ; Zhang, W.;
1845666, 1852606, 1838627, 1837956, 1956002, 2040588.                   and Liu, J. 2017. Can Decentralized Algorithms Outper-
                                                                        form Centralized Algorithms? A Case Study for Decentral-
                       References                                       ized Parallel Stochastic Gradient Descent. Neural Informa-
                                                                        tion Processing Systems .
Alistarh, D.; Grubic, D.; Li, J. Z.; Tomioka, R.; and Vo-
                                                                        Lian, X.; Zhang, W.; Zhang, C.; and Liu, J. 2018. Asyn-
jnovic, M. 2017. QSGD: Communication-Efficient SGD via
                                                                        chronous Decentralized Parallel Stochastic Gradient De-
Gradient Quantization and Encoding. Neural Information
                                                                        scent. International Conference on Machine Learning .
Processing Systems .
                                                                        Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; and Penksy, M.
Bernstein, J.; Wang, Y.; Azizzadenesheli, K.; and Anand-                2015. Sparse Convolutional Neural Networks. Conference
kumar, A. 2018a. signSGD: Compressed Optimization for                   on Computer Vision and Pattern Recognition .
Non-Convex Problems. arXiv:1802.04434 .
                                                                        Locatello, F.; Yurtsever, A.; Fercoq, O.; and Cevher, V. 2019.
Bernstein, J.; Zhao, J.; Azizzadenesheli, K.; and Anandku-              Stochastic Frank-Wolfe for Composite Convex Minimiza-
mar, A. 2018b. signSGD with Majority Vote is Communi-                   tion. In Advances in Neural Information Processing Sys-
cation Efficient And Fault Tolerant. arXiv:1810.05291 .                 tems, 14269–14279.
Bietti, A.; Mialon, G.; Chen, D.; and Mairal, J. 2019. A                Nazari, P.; Tarzanagh, D. A.; and Michailidis, G. 2019.
Kernel Perspective for Regularizing Deep Neural Networks.               DADAM: A Consensus-based Distributed Adaptive Gradi-
International Conference on Machine Learning .                          ent Method for Online Optimization. arXiv:1901.09109 .
Doan, T. T.; Maguluri, S. T.; and Romberg, J. 2018. Fast                Nedić, A.; Olshevsky, A.; and Shi, W. 2017. Achieving
Convergence Rates of Distributed Subgradient Methods                    Geometric Convergence for Distributed Optimization over
with Adaptive Quantization. arXiv:1810.13245 .                          Time-varying Graphs. SIAM Journal on Optimization 27(4):
                                                                        2597–2633.
Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch
stochastic approximation methods for nonconvex stochas-                 Nedić, A.; and Ozdaglar, A. 2009. Distributed Subgradient
tic composite optimization. Mathematical Programming                    Methods for Multi-Agent Optimization. IEEE transactions
155(1-2): 267–305.                                                      on Automatic Control 54(1): 48–61.
Harper, F. M.; and Konstan, J. A. 2015. The Movielens                   Ram, S. S.; Nedić, A.; and Veeravalli, V. V. 2010. Dis-
Datasets: History and Context. ACM Transactions on In-                  tributed stochastic subgradient projection algorithms for
teractive Intelligent Systems 5(4): 1–19.                               convex optimization. Journal of optimization theory and ap-
                                                                        plications 147(3): 516–545.
Huang, F.; Chen, S.; and Huang, H. 2019. Faster Stochas-
tic Alternating Direction Method of Multipliers for Noncon-             Reddi, S. J.; Sra, S.; Póczós, B.; and Smola, A. 2016.
vex Optimization. In International Conference on Machine                Stochastic Frank-Wolfe Methods for Nonconvex Optimiza-
Learning, 2839–2848.                                                    tion. arXiv:1607.08254 .
                                                                        Reisizadeh, A.; Mokhtari, A.; Hassani, H.; and Pedarsani, R.
Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-Free
                                                                        2019a. An Exact Quantized Decentralized Gradient Descent
Sparse Convex Optimization. In International Conference
                                                                        Algorithm. IEEE Transactions on Signal Processing 67(19):
on Machine Learning, 427–435.
                                                                        4934–4947.
Karimireddy, S. P.; Rebjock, Q.; Stich, S. U.; and Jaggi, M.            Reisizadeh, A.; Taheri, H.; Mokhtari, A.; Hassani, H.; and
2019. Error Feedback Fixes SignSGD and other Gradient                   Pedarsani, R. 2019b. Robust and Communication-Efficient
Compression Schemes. International Conference on Ma-                    Collaborative Learning. Neural Information Processing Sys-
chine Learning .                                                        tems .
Koloskova, A.; Lin, T.; Stich, S. U.; and Jaggi, M. 2020. De-           Seide, F.; Fu, H.; Droppo, J.; Li, G.; and Yu, D. 2014. 1-
centralized Deep Learning with Arbitrary Communication                  bit stochastic gradient descent and its application to data-
Compression. International Conference on Learning Repre-                parallel distributed training of speech dnns. In Fifteenth An-
sentations .                                                            nual Conference of the International Speech Communica-
Koloskova, A.; Stich, S. U.; and Jaggi, M. 2019. Decentral-             tion Association.
ized Stochastic Optimization and Gossip Algorithms with                 Tang, H.; Lian, X.; Qiu, S.; Yuan, L.; Zhang, C.; Zhang,
Compressed Communication. International Conference on                   T.; and Liu, J. 2019. DeepSqueeze: Decentralization Meets
Machine Learning .                                                      Error-Compensated Compression. arXiv:1907.07346 .
Konečný, J.; McMahan, H. B.; Ramage, D.; and Richtárik, P.           Tang, H.; Lian, X.; Yan, M.; Zhang, C.; and Liu, J. 2018.
2016. Federated Optimization: Distributed Machine Learn-                D2 : Decentralized Training over Decentralized Data. Inter-
ing for On-Device Intelligence. arXiv:1610.02527 .                      national Conference on Machine Learning .

                                                                10412
Wai, H.-T.; Lafond, J.; Scaglione, A.; and Moulines, E.
2017. Decentralized Frank–Wolfe algorithm for convex and
nonconvex problems. IEEE Transactions on Automatic Con-
trol 62(11): 5522–5537.
Xu, J.; Zhu, S.; Soh, Y. C.; and Xie, L. 2015. Augmented dis-
tributed gradient methods for multi-agent optimization un-
der uncoordinated constant stepsizes. In 2015 54th IEEE
Conference on Decision and Control (CDC), 2055–2060.
IEEE.
Xu, M.; Jin, R.; and Zhou, Z.-H. 2013. Speedup matrix
completion with side information: Application to multi-label
learning. In Advances in neural information processing sys-
tems, 2301–2309.
Yan, F.; Sundaram, S.; Vishwanathan, S.; and Qi, Y. 2012.
Distributed autonomous online learning: Regrets and intrin-
sic privacy-preserving properties. IEEE Transactions on
Knowledge and Data Engineering 25(11): 2483–2493.
Zhang, M.; Chen, L.; Mokhtari, A.; Hassani, H.; and
Karbasi, A. 2019. Quantized Frank-Wolfe: Faster Op-
timization, Lower Communication, and Projection Free.
arXiv:1902.06332 .

                                                                10413
You can also read