Communication-Efficient Frank-Wolfe Algorithm for Nonconvex Decentralized Distributed Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Communication-Efficient Frank-Wolfe Algorithm for Nonconvex Decentralized Distributed Learning Wenhan Xian 1 , Feihu Huang 1 , Heng Huang 1,2 1 Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, USA 2 JD Finance America Corporation, Mountain View, CA, USA wex37@pitt.edu, huangfeihu2018@gmail.com, heng.huang@pitt.edu Abstract machine learning due to its communication efficiency com- pared with the centralized fashion. Specifically, decentral- Recently decentralized optimization attracts much attention ized optimization adopts a pattern that each node maintains in machine learning because it is more communication- efficient than the centralized fashion. Quantization is a its own local data and model and only communicates with its promising method to reduce the communication cost via cut- neighbors. In fact, the communication complexity of decen- ting down the budget of each single communication using tralized system at each iteration is dependent on the degree the gradient compression. To further improve the communi- of graph topology (usually not dependent on the number of cation efficiency, more recently, some quantized decentral- nodes) and data locality is allowed. ized algorithms have been studied. However, the quantized Recently, the decentralized learning becomes a popular decentralized algorithm for nonconvex constrained machine research topic in machine learning and has been widely learning problems is still limited. Frank-Wolfe (a.k.a., con- studied. For example, it was first studied to solve problems ditional gradient or projection-free) method is very efficient to solve many constrained optimization tasks, such as low- of computing aggregates among clients or be used for the rank or sparsity-constrained models training. In this paper, to sake of data locality and privacy (Ram, Nedić, and Veer- fill the gap of decentralized quantized constrained optimiza- avalli 2010; Yan et al. 2012), where centralized structure tion, we propose a novel communication-efficient Decentral- is not allowed. A general decentralized algorithm can be ized Quantized Stochastic Frank-Wolfe (DQSFW) algorithm traced back to (Nedić and Ozdaglar 2009) that combines for non-convex constrained learning models. We first design gradient descent method and Gossip-type consensus step. a new counterexample to show that the vanilla decentralized Subsequently, a degenerated case of decentralization that quantized stochastic Frank-Wolfe algorithm usually diverges. achieves huge success is federated optimization (Konečný Thus, we propose DQSFW algorithm with the gradient track- et al. 2016), which adopts star topology but enables data to ing technique to guarantee the method will converge to the be drawn from non-iid distribution. Besides, many other pre- stationary point of non-convex optimization safely. In our the- oretical analysis, we prove that to achieve the stationary point vious fully decentralized works such as (Lian et al. 2017; our DQSFW algorithm achieves the same gradient complex- Tang et al. 2018) that based on general network topology ity as the standard stochastic Frank-Wolfe and centralized have shown that decentralized method is able to achieve Frank-Wolfe algorithms, but has much less communication more efficient communication without sacrificing the train- cost. Experiments on matrix completion and model compres- ing result, indicating that decentralization is becoming com- sion applications demonstrate the efficiency of our new algo- petitive and advantageous in distributed learning rather than rithm. just an alternate of centralization when centralization is not allowed. Lian et al. (2017) presented an important decen- Introduction tralized optimization work to verify that the decentralized method can outperform its centralized counterpart. Lian Nowadays, many machine learning tasks have been de- et al. (2017) proposed an algorithm named Decentralized ployed on distributed systems that enable computations in Parallel Stochastic Gradient Descent (D-PSGD) to directly parallel, especially for large-scale models such as deep neu- compute the averaging value among each node with ex- ral networks (DNNs). Recently, the centralized distribu- act communication, which has the same convergence rate tion structure has been used often. However, the central- as centralized SGD in nonconvex optimization with non- ized learning scheme has a key bottleneck of communica- identical data distribution. tion, where the communication burden of the central server To reduce the communication cost in distributed system, becomes larger as the number of nodes grows. For example, gradient quantization (Seide et al. 2014) is another effec- when the system has M workers, it will suffer from the com- tive method. Recently, many quantized gradient algorithms, munication complexity of O(M ). Thus, the decentralized such as QSGD (Alistarh et al. 2017), signSGD (Bernstein distribution structure recently has attracts much attention in et al. 2018a) and its variant (Bernstein et al. 2018b), were Copyright © 2021, Association for the Advancement of Artificial developed and showed excellent performance. In these algo- Intelligence (www.aaai.org). All rights reserved. rithms, the number of bits transmitted in each communica- 10405
tion round is reduced by packing and unpacking gradients. Quantized Stochastic Frank-Wolfe (DQSFW) algorithm to Alistarh et al. (2017) proposed an unbiased quantization solve the problem (1) using the gradient tracking technique scheme and proved it is capable to converge under convex to guarantee the DQSFW can safely and fast converge to and non-convex conditions. However, for other quantization the stationary point in non-convex optimization. Specifi- method like 1-bit quantization or signSGD, the unbiased as- cally, our DQSFW algorithm uses 1-bit gradient quantiza- sumption is not always satisfied. Karimireddy et al. (2019) tion scheme. In summary, the main contributions of this proved that when applying signSGD with a scalar factor paper are given as follows: and error-feedback technique, the algorithm is guaranteed to (1) We propose a novel efficient Decentralized Quantized converge in non-convex optimization. More recently, to fur- Stochastic Frank-Wolfe (DQSFW) method to solve the ther achieve communication-efficiency, multiple quantized problem (1) with less communication cost but still good decentralized algorithms (Doan, Maguluri, and Romberg convergence speed. 2018; Reisizadeh et al. 2019a,b; Tang et al. 2019; Koloskova et al. 2020) have been introduced. However, to the best of (2) We derive the rigorous theoretical analysis for our our knowledge, the existing quantized decentralized algo- DQSFW algorithm, and prove that our DQSFW algorithm rithm for constrained problem is still very limited. In fact, has the same gradient complexity O(−4 ) as the SFW the large-scale constrained optimization problems are pop- (Reddi et al. 2016) (the sequential algorithm) and QFW ular in many machine learning applications, such as matrix (Zhang et al. 2019) (the centralized algorithm), but with completion and deep neural network compression. much less communication cost. To address this challenging issue, in this paper, we focus (3) We provide a new intuitive counterexample to show that on studying the quantized decentralized algorithm for solv- the decentralized optimization involving non-linear pro- ing the following constraint optimization problem: jection of gradient could lead to a potential divergent M problem which also exists in many cases where we gen- 1 X eralize other non-Frank-Wolfe methods to decentralized min fi (x), (1) x∈Ω M i=1 algorithms. To tackle this challenge, we utilize the gra- dient tracking technique to guarantee the convergence of where fi (x) is a nonconvex smooth loss function, Ω is a con- our decentralized quantized Frank-Wolfe algorithm. vex and compact constraint set, M is the number of worker nodes. fi (x) is the objective function on node i and could Notations have the stochastic expectation or finite sum formulations: k·k1 denotes one norm of vector. k·k2 denotes spectral norm of matrix. k·kF denotes Frobenius norm of matrix. k·k∗ de- ( Eξ∼D F (i) (x; ξ), stochastic fi (x) = 1 Pni i (i) (2) notes trace norm of matrix. Let 1 be the column vector, each ni j=1 Fj (x), finite-sum entry of which is one. Given a network with M nodes, we where Di is the data distributed on i-th node. Apparently, define a mixing matrix W = (wij ) ∈ RM ×M that repre- finite-sum objective function is a particular case of stochas- sents the weights of neighbors in the communication round. tic problem where Di consists of finite samples. We allow For example, in D-PSGD (Lian et al. 2017), the consensus distributions Di to be non-identical, which is more adaptive step on i-th node is formulated as to general tasks in machine learning and is assumed in many M (i) (j) X previous decentralized analyses (Lian et al. 2017; Tang et al. xt = wij xt . 2018; Lian et al. 2018). i=1 To solve the above constrained optimization, the Frank- Wolfe (a.k.a, conditional gradient or projection-free) method Generally, W is a symmetric doubly stochastic matrix that is one of the most efficient and popular algorithms, be- satisfies W T = W and W 1 = 1. In the experiment section, cause the Frank-Wolfe method only requires to compute a we will consider a uniformly weighted ring-based network, linear oracle instead of the expensive projection operator whose mixing matrix is shown as Eq. (3). applied in proximal gradient methods (Ghadimi, Lan, and 1/3 1/3 0 ··· 0 1/3 Zhang 2016) and alternating direction method of multipliers 1/3 1/3 1/3 0 ··· 0 (Huang, Chen, and Huang 2019). In this paper, thus, we fo- 0 1/3 1/3 1/3 0 ··· cus on designing the communication-efficient quantized de- W = .. .. .. .. (3) . . . . centralized Frank-Wolfe algorithm to solve the above prob- lem (1). It is nontrivial to design such an algorithm. We 0 ··· 0 1/3 1/3 1/3 1/3 0 ··· 0 1/3 1/3 first provide a counterexample to show that the vanilla quan- tized decentralized Frank-Wolfe algorithm usually diverges Related Works (please see the following Counterexample section). Thus, there exists an important research problems to be addressed: Decentralized Frank-Wolfe Can we design a communication-efficient quantized de- Decentralized Frank-Wolfe algorithm (DeFW) (Wai et al. centralized Frank-Wolfe algorithm with convergence guar- 2017) is recently proposed to apply deterministic Frank- antee for non-convex optimization? Wolfe method in decentralized structure. It is guaranteed to In this paper, we answer the above challenging question converge in both convex and non-convex problems. The au- with positive solution and propose a novel Decentralized thors compute net averaging on parameters and gradients. 10406
Algorithm DeFW QFW DQSFW √ √ Decentralized × √ √ Stochastic × √ √ Quantized × Reference (Wai et al. 2017) (Zhang et al. 2019) Ours Table 1: Comparison of related works. In previous work about decentralization such as (Lian et al. 2017), they do not have to calculate the averaging of gradi- ents. Compared with DeFW, our DQSFW changes the deter- Figure 1: A graph to demonstrate the counter-example. ministic algorithm to a stochastic one, which is more adap- tive to large scale machine learning tasks. When the number of samples is very large, the full gradient is too expensive to Counterexample calculate. Besides, we also take advantage of the technique In this section, we provide an intuitive counterexample that of gradient quantization, which will further reduce the cost indicates the divergent trap if Frank-Wolfe method is sim- of communication. ply generalized to the decentralized algorithm without mak- Quantized Frank-Wolfe ing consensus on gradient when data at different nodes are drawn from non-identical distributions. Quantized Frank-Wolfe algorithm (QFW) (Zhang et al. Given f (x) = f1 (x) + f2 (x), where x ∈ Ω√ = 2019) is recently proposed to solve the centralized dis- {(x, y)|x2 + y 2 ≤ 1}, f1 (x) = x and f2 (x) = 3y tributed problems. It uses the the following momentum (See Figure √ 1). We can calculate √ gradients v1 = (1, 0), scheme as gradient estimator: v2 = (0, 3), v = (1, 3). Since the gradient is never ḡt = (1 − ρt )ḡt−1 + ρt gt (4) equal to 0, according to the Frank-Wolfe√ gap (see Eq. (9)), 1 3 which is also used in (Locatello et al. 2019) as a way to the only stationary point is (− 2 , − 2 ) (blue point), where √ decrease the noise of gradient. In our algorithm, we com- the tangent of unit ball is vertical to direction (1, 3). How- bine this momentum scheme with the Gossip method. For ever, if we update x by Frank-Wolfe algorithm on each node gradient compressing, they adopt the s-Partition Encoding separately, the linear oracle will yield d1 = (−1, 0) and Scheme, which encodes the i-th coordinates gi into an ele- d2 = (0, −1). Then we make consensus on x and get itera- ment from set {±1, ± s−1 1 s , · · · , ± s , 0}. It requires log2 (s + tion formula xi+1 = (1 − γ)xi + γ(− 21 , − 12 ). Sequence xi 1) bits to transfer each coordinate of the gradient. A scalar eventually converges to point (− 21 , − 12 ) (red point), which factor kgk∞ is also transmitted thus the total bits of quan- is not a stationary point. tized gradient is 32+d·log2 (s+1) where d is the dimension It is reasonable to credit the divergence to the non- of gradient. When s is large, the variance of this compres- commutative relation between linear oracle and addition. sor will become small (Zhang et al. 2019), which means the For SGD based decentralized learning algorithms, they can quantized gradient is more precise, but costs more bits. In converge well because of the commutative property of addi- this paper, we use 1-bit signSGD scheme with a scalar fac- tion. The above divergence problem is also likely to happen tor (Karimireddy et al. 2019; Koloskova, Stich, and Jaggi to other variant algorithms of SGD that involves non-linear 2019) shown as following: map of gradients in decentralized system, not just Frank- kxk1 Wolfe type methods. For example, adaptive gradient method C(x) = sign(x) (5) is a family of algorithms that adjust the learning rate accord- d ing to the magnitude of gradient. The Decentralized ADAM where d is the dimension of x. Notice that here signSGD is algorithm (DADAM) (Nazari, Tarzanagh, and Michailidis only a representative of feasible compressors. We can also 2019) was proved to converge under the criterion named lo- use other compressor as long as it satisfies the Compressor cal regret. Nonetheless, local regret probably leads to a result Assumption in section 4, which is an important assumption that each node converges to its own local stationary point, in many theoretical analysis of related work about gradi- while they do not cooperate well enough as an entire sys- ent quantization. For example, we can also use top-k SGD, tem. This phenomenon reminds us when we generalize an which is a gradient sparsification method that automatically algorithm with steps that are not commutative to addition, satisfies the Compressor Assumption. We consider signSGD similar divergence problem is likely to come out. because it is efficient and convenient to implement. In DeFW (Wai et al. 2017), the gradients can be averaged The comparisons between DeFW, QFW and our DQSFW directly by gradient tracking, a technique to accelerate con- are summarized in Table 1. We can see that our DQSFW al- sensus in distributed optimization (Xu et al. 2015; Nedić, gorithm is the first work to incorporate stochastic gradient Olshevsky, and Shi 2017). DIGing also considers the incre- descent and gradient quantization in decentralized Frank- ment of gradient when averaging the non-quantized full gra- Wolfe type algorithm. dient. However, in this paper we have to face the variance 10407
of stochastic gradient and the noise of quantization. These Algorithm 1 Decentralized Quantized Stochastic Frank- issues do not occur in DeFW. Therefore, we have to use a Wolfe (DQSFW) new strategy to let them make consensus gradually. Input: restricted domain Ω, matrix W , initial point X̂0 = X0 ∈ Ω New Decentralized Quantized Stochastic Parameter: ηt , γt , βt , αt , T Frank-Wolfe Algorithm Output: x̄t̂ , where t̂ is chosen uniformly from In this section, we propose a novel efficient Decentralized {0, 1, . . . , T } Quantized Stochastic Frank-Wolfe (DQSFW) algorithm to 1: On i-th node: solve the problem (1) by using the gradient tracking tech- 2: for t = 0, 1, . . . , T − 1 do (i) nique (Xu et al. 2015; Wai et al. 2017). The DQSFW algo- 3: Compute an estimation of the gradient gt rithm is given in Algorithm 1. 4: Update vt (i) (i) (i) = (1 − βt )vt−1 + βt gt + (i) In Algorithm 1, xt is a column vector that denote the P (j) (i) αt j wij (v̂t−1 − v̂t−1 ) model parameter on i-th node at iteration t. We use upper (i) (i) (i) case Xt to represent the matrix 5: Compute qt = C(vt − v̂t−1 ) and communicate (1) (2) (M ) with neighbors Xt = [xt , xt , . . . , xt ] 6: (j) (j) (j) Update replica v̂t = v̂t−1 + qt for neighbor j Inspired by Choco-Gossip algorithm (Koloskova, Stich, and (i) (i) 7: Calculate linear oracle dt such that dt = (i) (i) (i) Jaggi 2019), we also define a replicated x̂t of xt on each arg maxd∈Ω hd, −vt i node. The reason is when we apply gossip update, the ex- (i) (i) (i) 8: Update xt+1 = (1 − ηt )xt + ηt dt + act value of model parameter on other nodes are unknown P (j) (i) since there exists quantization and the communication is in- γt j wij (x̂t − x̂t ) (i) (i) (i) (i) (i) exact. The replica x̂t is an estimation of xt , which is also 9: Compute zt = C(xt+1 − x̂t ) and communicate updated at each iteration. And the consensus step is formu- with neighbors lated as line 8 in Algorithm 1. According to the update of 10: (j) (j) (j) Update replica x̂t+1 = x̂t + zt for neighbor j (i) x̂t (line 9 and line 10 in Algorithm 1), on all neighbors of 11: end for the i-th node, the replica is added by an indentical transmit- (i) (i) ted message zt , which implies the values x̂t on all neigh- (i) (i) (i) (i) (i) (i) xt should be xt+1 = xt + ηt (dt − xt ). For conve- bors of the i-th node are the same. Therefore, replica x̂t is nience, we define matrices well-defined. Similar to Xt , we also define matrices (1) (2) (M ) (1) (2) (M ) Vt = [vt , vt , . . . , vt ], X̂t = [x̂t , x̂t , . . . , x̂t ], (1) (2) (M ) V̂t = [v̂t , v̂t , . . . , v̂t ], Xt = [x̄t , x̄t , . . . , x̄t ] PM (i) Vt = [v̄t , v̄t , . . . , v̄t ], 1 where x̄t represent the mean value: x̄t = M i=1 xt . (1) (2) (M ) (i) Dt = [dt , dt , . . . , dt ], gt is a stochastic gradient on i-th node calculated by (i) Dt = [d¯t , d¯t , . . . , d¯t ] selected samples and vt is our key estimation of the gra- dient on i-th node which is defined as Eq. (6) with a kind where v̄t and d¯t are mean values (i) (i) M M of momentum scheme. Here v̂t is the replica of vt (see 1 X (i) 1 X (i) v̄t = vt , d¯t = d (i) (i) similar concept of x̂t ). For initialization, we set v̂−1 = 0 M i=1 M i=1 t (i) (i) and v−1 = g0 . The definition and role of βt will be dis- By the doubly stochastic property of W , we have cussed later in Remark 2. Our convergence analysis shows X t+1 = (1 − ηt )X t + ηt Dt (7) that though our gradient estimator in line 4 is biased, the gra- dients on all nodes are getting close to the full gradient uni- It is easy to verify that when x0 ∈ Ω, x̄t ∈ Ω for ∀t. Hence formly. To make consensus on gradient and parameter, we the constraint is always satisfied. Here we should notice that adopt the gossip update (Koloskova, Stich, and Jaggi 2019) we do not have to store all the replica in practice. P (j) (i) in line 4 and line 8 respectively. We can regard j wij (x̂t − x̂t ) as a term, and obtain (i) (i) (i) X (j) (i) iteration X formula vt = (1−βt )vt−1 +βt gt +αt wij (v̂t−1 − v̂t−1 ) (6) (j) (i) X (j) (i) wij (x̂t+1 − x̂t+1 ) = wij (x̂t − x̂t ) j j j In line 5 and line 9 of Algorithm 1, we apply a gradient X (j) (i) quantization method that satisfies the Compressor Assump- + wij (zt − zt ). (8) j tion. As mentioned previously, the quantization scheme is not limited to the signSGD used in this paper. Line 7 is the Therefore, we only need one buffer with the size of xt to (j) (i) typical linear oracle in Frank-Wolfe method to get a direc- P compute this term. So it is with j wij (v̂t − v̂t ). We (i) tion dt . In vanilla Frank-Wolfe algorithm, the update of will use Eq. (8) to save memory in our experiments. 10408
Convergence Analysis Next, we will propose the main theorem of our conver- In this section, we study the convergence properties of our gence analysis. Please check the detailed proof in supple- DQSFW algorithm. All proofs can be found in Supplemen- mentary material section B. tary Material. We begin with introducing some mild assump- Theorem 1. Let Q, R1 , R2 and S be the constants defined tions and the definition of Frank-Wolfe gap (Jaggi 2013): in Lemma 1 to Lemma 3. Step size ηt is set as Lemma 1. G(x) = maxhv − x, −∇f (x)i (9) Then we have v∈Ω p f (x̄t − f (x̄t+1 ) D 2(S + QR1 L2 D2 ) The convergence criteria is E[G(x)] ≤ . E[G(x̄t )] ≤ E ) + ηt (t + 1)θ/3 Assumption 1. (Lipschitz Gradient) There is a constant L √ such that for ∀i ∈ {1, 2, . . . , M }, we have D QR2 G ηt L2 D2 + + . (t + 1)θ/3 2 k∇fi (x) − ∇fi (y)k ≤ Lkx − yk. (10) Theorem 2. Suppose T iterations have been completed. Assumption 2. (Compact Domain) There is a diameter D Let t̂ is chosen randomly with identical probability from of domain Ω. {0, 1, . . . , T }. Set θ = 34 . Then by Theorem 1 we can obtain Assumption 3. (Lower Bound) Function f (x) has the lower bound inf x∈Ω f (x) = f − > −∞. 1 E[G(x̄t̂ )] = O( ). Assumption 4. (Spectral Gap) Given the symmetric dou- T 1/4 bly stochastic matrix W , we define λ1 , λ2 , . . . , λM to be its Remark 1. Theorem 2 shows that our DQSFW algo- eigenvalues in descending order. Then max{|λ2 |, |λM |} < rithm reach a gradient complexity of O(−4 ) to achieve 1. Let ρ = max{|λ2 |, |λM |} and ζ = 1 − λM . -accuracy stationary point. And the Frank-Wolfe gap is Assumption 5. (Compressor Assumption) Compressor C(·) asymptotic to 0, rather than a neighborhood of which the satisfies kC(x) − xk2 ≤ (1 − δ)kxk2 , where 0 < δ ≤ 1. size is dependent on . This is because all parameters in our Assumption 6. (Bounded Gradient and Bounded Variance) algorithm are independent of , while in SFW, step size and (i) (i) number of iterations are functions of . The generated gradient estimator gt satisfies E[gt ] = (i) (i) (i) 2 (i) Remark 2. θ = 34 is the best trade-off between consensus ∇fi (xt ), Ekgt −∇fi (xt )k ≤ σ 2 , k∇F (i) (xt ; ξ)k ≤ and stepsize. If βt is too large, the noise of quantization and G. the variance of stochastic gradient will cause bad consensus Based on above assumptions, we demonstrate three lem- and then affect the convergence. If βt is too small, the step- mas of our DQSFW algorithm. Lemma 1 and lemma 2 es- size should also be small. Otherwise the averaged gradient timate the consensus between model parameter Xt and gra- cannot catch up with the changing of x, which will cause dient Vt respectively. Lemma 3 implies that our gradient es- slow convergence. This trade-off is the challenge and intu- timator Eq. (6) on different node approaches the value of ition to define our gradient estimator as Eq. (6). full gradient uniformlly and gradually. The detailed proof of lemmas can be found in supplementary material section A. Remark 3. From the proof in supplementary material we √ can see the theoretical framework does not only work for Lemma 1. Let δ0 = 1 − 1 − δ 2 , δ1 = 1 − (1 − δ02 )2 . signSGD, but also all compressors that satisfy Assumption αt = γt = γ = min{1, δζ0 , (1−ρ)δ 2ζ 2 }. c1 = (1−ρ)γ, c2 = δ, 1 5. c3 = min{ (1−ρ)γ 2 , δ21 }. A = (1 + c1 )(1 − (1 − ρ)γ)2 + (1 − δ)(1+ c12 )γ 2 ζ 2 , B = (1+ c11 )γ 2 ζ 2 +(1−δ)(1+c2 )(1+γζ)2 . Experimental Results c33 To validate the efficiency of our new DQSFW algorithm, we Let Q = 8(1 + c13 )(A + B). Set η0 = 16(1+c3 )(A+B) and η0 conduct the experiments on two constrained machine learn- ηt = (t+1) θ , 0 < θ < 1. Then there exists a constant R1 ing applications: matrix completion and model compression. satisfying QR1 M D2 Decentralized Low-Rank Matrix Completion kXt − Xt k2F + kXt − X̂t k2F ≤ Low-rank matrix completion is a model to solve a broad (t + 1)2θ range of learning tasks, such as collaborative filtering (Ko- Lemma 2. Let c4 = c23 , β0 = B(1+c Ac4 4) and βt = β0 (t+1)2θ/3 . ren, Bell, and Volinsky 2009) and multi-label learning (Xu, Then there exists constant R2 such that Jin, and Zhou 2013). The loss function of low-rank matrix completion problem has the following form: QR2 M G2 kVt − Vt k2F + kVt − V̂t k2F ≤ X (t + 1)2θ/3 min φ(Xij − Yij ), s.t. kXk∗ ≤ C X∈RM ×N PM (i) (i,j)∈Ω 1 Lemma 3. Denote v̄t = M i=1 vt . There exists constant S such that where φ : R → R is the potential non-convex empirical loss M function. Y is the target matrix and Ω is the set of observed 1 X (i) S entries. (Wai et al. 2017) also conducts this experiment with Ekv̄t − ∇fi (xt )k2 ≤ M i=1 (t + 1)2θ/3 MSE loss function and robust Gaussian loss function. In our 10409
10 4 10 4 10 4 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 500 1000 1500 2000 2500 3000 3500 4000 0 1000 2000 3000 4000 5000 6000 7000 0 2000 4000 6000 8000 10000 (a) (b) (c) 10 5 10 5 10 5 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 1000 2000 3000 4000 5000 6000 7000 0 2000 4000 6000 8000 10000 12000 0 0.5 1 1.5 2 10 4 (d) (e) (f) Figure 2: The experimental results of decentralized low-rank matrix completion for dataset MovieLens 100k and MovieLens 1m. Figure (a), (b) and (c) show the training loss with respect to bits transferred on MovieLens 100k with 20, 40 and 60 workers respectively. Figure (d), (e) and (f) show the training loss with respect to bits transferred on MovieLens 1m with 20, 40 and 60 workers respectively. Name User Movie Record For both datasets, we deploy our experiment on M = MovieLens 100k 943 1682 100000 20, 40, 60 MPI worker nodes respectively by mpi4py. Each MovieLens 1m 6040 3952 1000209 node is an Intel Xeon E5-2660 machine within an infini- band network. We assign 1/M of the rating records to each worker. For MovieLens 100k, we set C = 2000 while for Table 2: Descriptions of MovieLens Datasets. MovieLens 1m we set C = 5000. In this task, the linear oracle can be obtained by singular (i) experiment, to verify that our algorithm works well for non- value decomposition (SVD). Let the SVD of vt be U · S · T T convex objective functions, we adopt the robust Gaussian V . Then the linear oracle d = −C · u · v where u and v loss function are the singular vectors corresponding to the largest singular value (also named leading vectors of SVD). In practice, we σ0 z2 only need to compute the leading vectors, while in projected φ(z) = (1 − exp(− )) (11) 2 σ0 algorithms we have to do the completed SVD. In our experiment, parameter σ0 is fixed to be 2. We We choose two other projection-free methods DeFW (Wai run our experiment on two benchmark datasets, MovieLens et al. 2017) and QFW (Zhang et al. 2019) as baseline 100k and MovieLens 1m (Harper and Konstan 2015). Both methods. For decentralized algorithms, we use a ring-based of two datasets are records of movie ratings from plenty of topology as the communication network because it is con- users, and are usually used to train recommendation sys- venient to implement and achieves linear speedup in com- tems. The descriptions of these two datasets are shown in munication (Lian et al. 2017). For QFW, s of the s-partition Table 2. MovieLens 100k has 943 users, 1682 movies, and encoding is set to be 1. For all of the three algorithms, we set 100000 rating records. MovieLens 1M has 6040 users, 3952 step size ηt = t−0.75 . The results of low-rank matrix com- movies, and 1000209 rating records. All ratings vary from pletion on MovieLens 100k and MovieLens 1m are shown in 0 to 5. We scale them to interval [0, 1]. The rating records Figure 2. As many previous quantization work did (includ- can be converted into matrix, where row represents user id ing QFW), we analyze the experimental results with respect and column represents movie id. Each record serves as an to bits transferred, which means the number of bits sent or observation. As our purpose is to verify the performance of received on the busiest node. For decentralized algorithms it optimization algorithm, we take all data for training. can be any node and for centralized algorithm it is the mas- 10410
1.2 1.2 1.1 1 1 0.9 0.8 0.8 0.6 0.7 0.6 0.4 0.5 0.4 0.2 0.3 0 0.2 0 500 1000 1500 2000 2500 0 1000 2000 3000 4000 5000 6000 7000 8000 (a) (b) Figure 3: The experimental result of decentralized model compression for VGG11 neural network on dataset CIFAR 10. Figure (a) shows the training loss against time. Figure (b) shows the training loss against the number of bits transferred ter node. The number is divided by the size of x since it is GTX1080 GPUs by Pytorch. Each GPU is treated as a single always proportional to the size of x. worker. The communication is based on NVIDIA NCCL. From the experimental results we can see that our We consider DeFW and QFW as baseline methods and DQSFW algorithm achieves the best performance on both ring-based topology as communication network. For DeFW datasets. Moreover, we can see that our algorithm becomes and our DQSFW, the decentralized system is uniform weight more competitive as the number of workers becomes larger, ring network. For QFW, s is set to be 1. For all three algo- which verifies the scalability of our method. With more rithms, step size is chosen as ηt = 12 t−0.75 . Because of the workers, more sample gradients can be computed, but the limitation of CUDA memory, we cannot compute the full number of gradients computed by DeFW keeps the same. gradient for DeFW. We calculate 1/5 of the full gradient instead. This issue also indicates the limitation of DeFW al- Model Compression gorithm. Deep neural networks (DNNs) have achieved remarkable The experimental results are visualized in Figure 3. To performance in many fields at recent years. But one of its validate the efficiency of our algorithm, we compare the loss shortages is the high cost of a large number of parameters. with respect to the bits transmitted. Similar to the matrix Thus, many attempts have been made to reduce the num- completion experiment, the number of bits transferred is di- ber of parameters in DNNs, such as dropout and pruning. vided by the size of parameter. For decentralized algorithms, Among these methods, one solution is to add constraint on the number is counted on any node, while for centralized al- parameters to make them sparse compulsively (Liu et al. gorithm it is counted on master node. According to the re- 2015; Bietti et al. 2019). In (Bietti et al. 2019), one popu- sults, we can see that DeFW is almost infeasible for this task. lar method was proposed via adding the spectral norm con- From the view of time, QFW and DQSFW have similar per- straint as follows: formance. From the view of bits transferred, our DQSFW has the best performance among the three algorithms, which kWl k2 ≤ τ (12) verifies the superior performance of our new algorithm. for each layer l. Because the model compression is attract- ing increasing attentions in both machine learning research Conclusion and applications, in this experiment, we solve the model In this paper, we proposed a new Decentralized Quantized compression problem using the decentralized learning set- Stochastic Frank-Wolfe (DQSFW) algorithm to solve the ting, and validate the performance of different decentralized non-convex constrained optimization problem. We revealed quantized algorithms on this task. Here the linear oracle is a potential divergence problem that is likely to occur in the (i) d = −τ · U · V T , where U · S · V T is the SVD of vt . This general decentralized training, not just for Frank-Wolfe type result can be achieved easily by the fact that trace norm and methods, and also provided a solution by making consensus spectral norm are dual norms. on gradient. We derived the new theoretical analysis to prove In our experiment, we run this task to compress VGG11 that our algorithm can achieve the same gradient complexity network on CIFAR 10 dataset, which has 50000 training O(−4 ) as the Stochastic Frank-Wolfe (SFW) method with samples and 10 labels, with constraint (12). We conduct this much less communication cost, and the Frank-Wolfe gap is task on decentralized settings where data are distributed on asymptotic to 0. The experimental results on two machine different nodes to verify our algorithms. Following (Bietti learning applications, matrix completion and deep neural et al. 2019), we use cross-entropy loss function as the crite- network compression, validate the superior performance of rion and set τ = 0.8. The experiment is implemented on 8 our new algorithm. 10411
Acknowledgements Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factoriza- tion techniques for recommender systems. Computer 42(8): We thank the anonymous reviewers for their helpful com- 30–37. ments. We also thank the IT Help Desk at the University of Pittsburgh. This work was partially supported by NSF IIS Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; ; Zhang, W.; 1845666, 1852606, 1838627, 1837956, 1956002, 2040588. and Liu, J. 2017. Can Decentralized Algorithms Outper- form Centralized Algorithms? A Case Study for Decentral- References ized Parallel Stochastic Gradient Descent. Neural Informa- tion Processing Systems . Alistarh, D.; Grubic, D.; Li, J. Z.; Tomioka, R.; and Vo- Lian, X.; Zhang, W.; Zhang, C.; and Liu, J. 2018. Asyn- jnovic, M. 2017. QSGD: Communication-Efficient SGD via chronous Decentralized Parallel Stochastic Gradient De- Gradient Quantization and Encoding. Neural Information scent. International Conference on Machine Learning . Processing Systems . Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; and Penksy, M. Bernstein, J.; Wang, Y.; Azizzadenesheli, K.; and Anand- 2015. Sparse Convolutional Neural Networks. Conference kumar, A. 2018a. signSGD: Compressed Optimization for on Computer Vision and Pattern Recognition . Non-Convex Problems. arXiv:1802.04434 . Locatello, F.; Yurtsever, A.; Fercoq, O.; and Cevher, V. 2019. Bernstein, J.; Zhao, J.; Azizzadenesheli, K.; and Anandku- Stochastic Frank-Wolfe for Composite Convex Minimiza- mar, A. 2018b. signSGD with Majority Vote is Communi- tion. In Advances in Neural Information Processing Sys- cation Efficient And Fault Tolerant. arXiv:1810.05291 . tems, 14269–14279. Bietti, A.; Mialon, G.; Chen, D.; and Mairal, J. 2019. A Nazari, P.; Tarzanagh, D. A.; and Michailidis, G. 2019. Kernel Perspective for Regularizing Deep Neural Networks. DADAM: A Consensus-based Distributed Adaptive Gradi- International Conference on Machine Learning . ent Method for Online Optimization. arXiv:1901.09109 . Doan, T. T.; Maguluri, S. T.; and Romberg, J. 2018. Fast Nedić, A.; Olshevsky, A.; and Shi, W. 2017. Achieving Convergence Rates of Distributed Subgradient Methods Geometric Convergence for Distributed Optimization over with Adaptive Quantization. arXiv:1810.13245 . Time-varying Graphs. SIAM Journal on Optimization 27(4): 2597–2633. Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch stochastic approximation methods for nonconvex stochas- Nedić, A.; and Ozdaglar, A. 2009. Distributed Subgradient tic composite optimization. Mathematical Programming Methods for Multi-Agent Optimization. IEEE transactions 155(1-2): 267–305. on Automatic Control 54(1): 48–61. Harper, F. M.; and Konstan, J. A. 2015. The Movielens Ram, S. S.; Nedić, A.; and Veeravalli, V. V. 2010. Dis- Datasets: History and Context. ACM Transactions on In- tributed stochastic subgradient projection algorithms for teractive Intelligent Systems 5(4): 1–19. convex optimization. Journal of optimization theory and ap- plications 147(3): 516–545. Huang, F.; Chen, S.; and Huang, H. 2019. Faster Stochas- tic Alternating Direction Method of Multipliers for Noncon- Reddi, S. J.; Sra, S.; Póczós, B.; and Smola, A. 2016. vex Optimization. In International Conference on Machine Stochastic Frank-Wolfe Methods for Nonconvex Optimiza- Learning, 2839–2848. tion. arXiv:1607.08254 . Reisizadeh, A.; Mokhtari, A.; Hassani, H.; and Pedarsani, R. Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-Free 2019a. An Exact Quantized Decentralized Gradient Descent Sparse Convex Optimization. In International Conference Algorithm. IEEE Transactions on Signal Processing 67(19): on Machine Learning, 427–435. 4934–4947. Karimireddy, S. P.; Rebjock, Q.; Stich, S. U.; and Jaggi, M. Reisizadeh, A.; Taheri, H.; Mokhtari, A.; Hassani, H.; and 2019. Error Feedback Fixes SignSGD and other Gradient Pedarsani, R. 2019b. Robust and Communication-Efficient Compression Schemes. International Conference on Ma- Collaborative Learning. Neural Information Processing Sys- chine Learning . tems . Koloskova, A.; Lin, T.; Stich, S. U.; and Jaggi, M. 2020. De- Seide, F.; Fu, H.; Droppo, J.; Li, G.; and Yu, D. 2014. 1- centralized Deep Learning with Arbitrary Communication bit stochastic gradient descent and its application to data- Compression. International Conference on Learning Repre- parallel distributed training of speech dnns. In Fifteenth An- sentations . nual Conference of the International Speech Communica- Koloskova, A.; Stich, S. U.; and Jaggi, M. 2019. Decentral- tion Association. ized Stochastic Optimization and Gossip Algorithms with Tang, H.; Lian, X.; Qiu, S.; Yuan, L.; Zhang, C.; Zhang, Compressed Communication. International Conference on T.; and Liu, J. 2019. DeepSqueeze: Decentralization Meets Machine Learning . Error-Compensated Compression. arXiv:1907.07346 . Konečný, J.; McMahan, H. B.; Ramage, D.; and Richtárik, P. Tang, H.; Lian, X.; Yan, M.; Zhang, C.; and Liu, J. 2018. 2016. Federated Optimization: Distributed Machine Learn- D2 : Decentralized Training over Decentralized Data. Inter- ing for On-Device Intelligence. arXiv:1610.02527 . national Conference on Machine Learning . 10412
Wai, H.-T.; Lafond, J.; Scaglione, A.; and Moulines, E. 2017. Decentralized Frank–Wolfe algorithm for convex and nonconvex problems. IEEE Transactions on Automatic Con- trol 62(11): 5522–5537. Xu, J.; Zhu, S.; Soh, Y. C.; and Xie, L. 2015. Augmented dis- tributed gradient methods for multi-agent optimization un- der uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), 2055–2060. IEEE. Xu, M.; Jin, R.; and Zhou, Z.-H. 2013. Speedup matrix completion with side information: Application to multi-label learning. In Advances in neural information processing sys- tems, 2301–2309. Yan, F.; Sundaram, S.; Vishwanathan, S.; and Qi, Y. 2012. Distributed autonomous online learning: Regrets and intrin- sic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering 25(11): 2483–2493. Zhang, M.; Chen, L.; Mokhtari, A.; Hassani, H.; and Karbasi, A. 2019. Quantized Frank-Wolfe: Faster Op- timization, Lower Communication, and Projection Free. arXiv:1902.06332 . 10413
You can also read