Towards Heterogeneous Clients with Elastic Federated Learning

Page created by Fred Castillo
 
CONTINUE READING
Towards Heterogeneous Clients with Elastic Federated Learning
Towards Heterogeneous Clients with Elastic Federated Learning

                                                   Zichen Ma1,2 , Yu Lu1,2 , Zihan Lu3 , Wenye Li1 and Jinfeng Yi2 , Shuguang Cui1
                                                                    1
                                                                      The Chinese University of Hong Kong, Shenzhen
                                                                                        2
                                                                                          JD AI Lab
                                                                                  3
                                                                                    Ping An Technology
arXiv:2106.09433v1 [cs.LG] 17 Jun 2021

                                                                   Abstract                                     While there are many variants of FedAvg and they have
                                                                                                             shown empirical success in the non-IID settings, these algo-
                                              Federated learning involves training machine learn-            rithms do not fully address bias in the system. The solutions
                                              ing models over devices or data silos, such as edge            are sub-optimal as they either employ a small shared global
                                              processors or data warehouses, while keeping the               subset of data [Zhao et al., 2018] or greater number of mod-
                                              data local. Training in heterogeneous and poten-               els with increased communication costs [Karimireddy et al.,
                                              tially massive networks introduces bias into the sys-          2020b; Li et al., 2018; Li et al., 2019]. Moreover, to the best
                                              tem, which is originated from the non-IID data and             of our knowledge, previous models do not consider the low
                                              the low participation rate in reality. In this pa-             participation rate, which may restrict the potential availabil-
                                              per, we propose Elastic Federated Learning (EFL),              ity of training datasets, and weaken the applicability of the
                                              an unbiased algorithm to tackle the heterogeneity              system.
                                              in the system, which makes the most informative                   In this paper, we develop Elastic Federated Learning
                                              parameters less volatile during training, and uti-             (EFL), which is an unbiased algorithm that aims to tackle the
                                              lizes the incomplete local updates. It is an effi-             statistical heterogeneity and the low participation rate chal-
                                              cient and effective algorithm that compresses both             lenge.
                                              upstream and downstream communications. Theo-                     Contributions of the paper are as follows: Firstly, EFL is
                                              retically, the algorithm has convergence guarantee             robust to the non-IID data setting. It incorporates an elastic
                                              when training on the non-IID data at the low partic-           term into the local objective to improve the stability of the
                                              ipation rate. Empirical experiments corroborate the            algorithm, and makes the most informative parameters, which
                                              competitive performance of EFL framework on the                are identified by the Fisher information matrix, less volatile.
                                              robustness and the efficiency.                                 Theoretically, we provide the convergence guarantees for the
                                                                                                             algorithm.
                                         1   Introduction                                                       Secondly, even when the system is in the low participation
                                         Federated learning (FL) has been an attractive distributed ma-      rate, i.e., many clients may be inactive or return incomplete
                                         chine learning paradigm where participants jointly learn a          updates, EFL still converges. It utilizes the partial informa-
                                         global model without data sharing [McMahan et al., 2017a].          tion by scaling the corresponding aggregation coefficient. We
                                         It embodies the principles of focused collection and data min-      show that the low participation rate will not impact the con-
                                         imization, and can mitigate many of the systemic privacy            vergence, but the tolerance to it diminishes as the learning
                                         risks and costs resulting from traditional, centralized machine     continuing.
                                         learning [Kairouz et al., 2019]. While there are plenty of             Thirdly, the proposed EFL is a communication-efficient
                                         works on federated optimization, bias in the system still re-       algorithm that compresses both upstream and downstream
                                         mains a key challenge. The origins of bias are from (i) the sta-    communications. We provide the convergence analysis of the
                                         tistical heterogeneity that data are not independent and identi-    compressed algorithm as well as extensive empirical results
                                         cally distributed (IID) across clients (ii) the low participation   on different datasets. The algorithm requires both fewer gra-
                                         rate that is due to limited computing and communication re-         dient evaluations and communicated bits to converge.
                                         sources, e.g., network condition, battery, processors, etc.
                                            Existing FL methods empower participants accomplish              2   Related Work
                                         several local updates, and the server will abandon struggle         Federated Optimization Recently we have witnessed signif-
                                         clients, which attempt to alleviate communication burden.           icant progress in developing novel methods that address dif-
                                         The popular algorithm, FedAvg [McMahan et al., 2017a],              ferent challenges in FL; see [Kairouz et al., 2019; Li et al.,
                                         first allows clients to perform a small number of epochs of lo-     2020a]. In particular, there have been several works on var-
                                         cal stochastic gradient descent (SGD), then successfully com-       ious aspects of FL, including preserving the privacy of users
                                         pleted clients communicate their model updates back to the          [Duchi et al., 2014; McMahan et al., 2017b; Agarwal et al.,
                                         server, and stragglers will be abandoned.                           2018; Zhu et al., 2020] and lowering communication cost
Towards Heterogeneous Clients with Elastic Federated Learning
[Reisizadeh et al., 2020; Dai et al., 2019; Basu et al., 2019;      Algorithm 1 EFL. N clients are indexed by k; B is the local
Li et al., 2020b]. Several works develop algorithms for the         mini-batch size, pk is the probability of the client is selected
homogeneous setting, where the data samples of all users are        ,ητ is the learning rate, E is the maximum number of time
sampled from the same probability distribution [Stich, 2018;        steps each round has, and 0 ≤ skτ ≤ E is the number of local
Wang and Joshi, 2018; Zhou and Cong, 2017; Lin et al.,              updates the client completes in the τ -th round.
2018]. More related to our paper, there are several works           Server executes:
that study statistical heterogeneity of users’ data samples in         initialize ω0 , each client k is selected with probability pk ,
FL [Zhao et al., 2018; Sahu et al., 2018; Karimireddy et               R0G , R0k ← 0.
al., 2020b; Haddadpour and Mahdavi, 2019; Li et al., 2019;             for each round τ = 1,2, ... do
Khaled et al., 2020], but the solutions are not optimal as they             Sτ ← (random subset of N clients)
either violate privacy requirements or increase the communi-                for each client k ∈ St in parallel do
cation burden.                                                                  ∆ωτkE , ukτ , vτk ← ClientUpdate(k, ST (∆ω(τ     G
                                                                                                                                   −1)E ),
   Lifelong Learning The problem is defined as learning sep-           P k          P k
arate tasks sequentially using a single model without forget-                u
                                                                          k τ −1  ,     v
                                                                                       k τ −1  )
                                                                                ∆ωτGE = R(τ    G
                                                                                                         P k          k
ting the previously learned tasks. In this context, several pop-                                 −1)E +     k pτ ∆ωτ E
ular approaches have been proposed such as data distillation                      G            G               G
                                                                                Rτ E = ∆ωτ E − ST (∆ωτ E )
[Parisi et al., 2018], model expansion [Rusu et al., 2016;                        G
                                                                                ω(τ              G        G
                                                                                    +1)E = ωτ E + ∆ωτ E
Draelos et al., 2017], and memory consolidation [Soltog-                    end for
gio, 2015; Shin et al., 2017], a particularly successful one is             return ST (∆ωτGE ), k ukτ , k vτk to participants
                                                                                                   P       P
EWC [Kirkpatrick et al., 2017], a method to aid the sequen-            end for
tial learning of tasks.                                                                            G
                                                                                                             P k           P k
                                                                    ClientUpdate(k, ST (∆ω(τ         −1)E ),     k uτ −1 ,  k vτ −1 ):
   To draw an analogy between federated learning and the
problem of lifelong learning, we consider the problem of               ξ ← split local data into batches of size B
learning a model on each client in the non-IID setting as a sep-       for batch ξ ∈ B do
arate learning problem. In this sense, it is natural to use simi-           ωτkE = ω(τk               G
                                                                                        −1)E + ∆ω(τ −1)E
lar tools to alleviate the bias challenge. While two paradigms              for j = 0, ..., skτ − 1 do
share a common main challenge in some context, learning                         ωτkE+j+1 = ωτkE+j − ητ gτkE+j
tasks in lifelong learning are serially carried rather than in              end for
parallel, and each task is seen only once in it, whereas there              ∆ωτkE = R(τ  k             k
                                                                                                               − ωτkE
                                                                                           −1)E + ωτ E+sk
is no such limitation in federated learning.                                                                 τ
                                                                            RτkE = ∆ωτkE − ST (∆ωτkE )
   Communication-efficient Distributed Learning A wide
variety of methods have been proposed to reduce the amount                  ukτ = diag(Iτ,k )
of communication in distributed machine learning. The sub-                  vτk = diag(Iτ,k )ωτkE+sk
                                                                                                      τ
stantial existing research focuses on (i) communication de-            end for
lay that reduces the communication frequency by performing             return ∆ωτkE , ukτ , vτk to the server
local optimization [Konečnỳ et al., 2016; McMahan et al.,
2017a] (ii) sparsification that reduces the entropy of updates
by restricting changes to only a small subset of parameters         where N is the number of participants, pk denotes the proba-
[Aji and Heafield, 2017; Tsuzuku et al., 2018] (iii) dense          bility of k-th client is selected. Here ω represents the param-
quantization that reduces the entropy of the weight updates         eters of the model, and Fek (ω) is the local objective of k-th
by restricting all updates to a reduced set of values [Alistarh     client.
et al., 2017; Bernstein et al., 2018].                                 Assuming there are at most T rounds. For the τ -th round,
   Out of all the above-listed methods, only FedAvg and             the clients are connected via a central aggregating server, and
signSGD compress both upstream and downstream commu-                seek to optimize the following objective locally:
nications. All other methods are of limited utility in FL set-
ting, as they leave communications from the server to clients                                N
                                                                                        λ    X
uncompressed.                                                       Feτ,k (ω) = fk (ω)+            (ω−ωτi −1 )T diag(Iτ −1,i )(ω−ωτi −1 ),
                                                                                        2    i=1
                                                                                                                                   (2)
3     Elastic Federated Learning                                    where fk (ω) is the local empirical risk over all available sam-
3.1    Problem Formulation                                          ples at k-th client. ωτi −1 is the model parameters of i-th client
                                                                    in the (τ − 1)-th round. Iτ −1,i = I(ωτi −1 ) is the Fisher infor-
EFL is designed to mitigate the heterogeneity in the system,        mation matrix, which is the negative expected Hessian of log
where the problem is originated from the non-IID data across        likelihood function, and diag(Iτ −1,i ) is the matrix that pre-
clients and the low participation rate. In particular, the aim is   serves values of diagonal of the Fisher information matrix,
to minimize:                                                        which aims to penalize parts of the parameters that are too
                                N
                                X                                   volatile in a round.
                  min F (ω) =         pk Fek (ω),            (1)       We propose to add the elastic term (the second term of
                   ω                                                Equation (2)) to the local subproblem to restrict the most in-
                                k=1
Towards Heterogeneous Clients with Elastic Federated Learning
formative parameters’ change. It alleviates bias that is origi-                  Algorithm 2 Compression Method ST . q is the sparsity, ten-
nated from the non-IID data and stabilizes the training. Equa-                   sor T ∈ Rn , T̂ ∈ {−µ, 0, µ}n
tion (2) can be further rearranged as                                            ST(T ):
                             λ    XN                                               k = max(nq, 1); e = topk (|T |)
         Feτ,k (ω) = fk (ω) + ω T     diag(Iτ −1,i )ω                              mask =  P(|T  | ≥ e) ∈ {0, 1}n ; T mask = mask × T
                             2    i=1
                                                                                         1   n
                                                                                   µ = k i=1 |Timask |
                                                                           (3)
                             N
                             X                                                     T̂ = µ × sign(T mask )
                   − λω T          diag(Iτ −1,i )ωτi −1 + Z,                       return T̂
                             i=1

where Z is a constant. Let ukτ−1 = diag(Iτ −1,k ), and vτk−1 =
                                                                                 low participation rate, which are not yet well discussed pre-
diag(Iτ −1,k )ωτk−1 .                                                            viously: (i) incomplete clients that can only submit partially
   Suppose ω ∗ is the minimizer of the global objective F , and                  complete updates (ii) inactive clients that cannot respond to
denote by Fek∗ the optimal value of Fek . We further define the                  the server.
degree to which data at k-th client is distributed differently                       The client k is inactive in the τ -th round if skτ = 0, i.e. it
than that at other clients as Dk = Fek (ω ∗ ) − Fek∗ , and D =                   does not perform the local training, and the client k is incom-
                                                                                 plete if 0 < skτ < E. skτ is a random variable that can follow
PN       k
   k=1 p Dk . We consider discrete time steps t = 0, 1, ....
Model weights are aggregated and synchronized when t is a                        an arbitrary distribution. It can generally be time-varying, i.e.,
multiple of E, i.e., each round consists of E time steps. In                     it may follow different distributions at different time steps.
the τ -th round, EFL, presented in Algorithm 1, executes the                     EFL also allows the aggregation coefficient pkτ to vary with
following steps:                                                                 τ , and in the next subsection, we explore different schemes
   Firstly, the server broadcasts the                                            of choosing pkτ and their impacts on the model convergence.
                          G
                                    Pcompressed
                                          k
                                                    latest
                                                    P kglobal
weight updates ST (∆ω(τ     −1)E ),    u
                                      k τ −1  , and    k vτ −1 to
                                                                                     EFL also incorporates the sparsification and quantization
participants. Each client then updates its local weight: ωτkE =                  to compress both the upstream (from clients to the server)
  k                G                                                             and the downstream (from the server to clients) communica-
ω(τ −1)E + ∆ω(τ −1)E .                                                           tions. It is not economical to only communicate the fraction
   Secondly, each client runs SGD on its local objective Fek                     of largest elements at full precision as regular top-k sparsi-
for j = 0, ..., skτ − 1:                                                         fication [Aji and Heafield, 2017]. As a result, EFL quan-
                                                                                 tizes the remaining top-k elements of the sparsified updates
                 ωτkE+j+1 = ωτkE+j − ητ gτkE+j ,                           (4)   to the mean population magnitude, leaving the updates with
                                                                                 a ternary tensor containing {−µ, 0, µ}, which is summarized
where ητ is a learning rate that decays with τ , 0 ≤ skτ ≤ E is                  in Algorithm 2.
the number of local updates the client completes in the τ -th
round, gtk = ∇Fek (ωtk , ξtk ) is the stochastic gradient of k-th                3.2   Convergence Analysis
client, and ξtk is a mini-batch sampled from client k’s local
data. g kt = ∇Fek (ωtk ) is the full batch gradient at client k, and             Five assumptions are made to help analyze the convergence
                                                                                 behaviors of EFL algorithm.
g kt = Eξtk [gtk ], ∆ωτkE = R(τ k            k
                                  −1)E + ωτ E+sk    − ωτkE , where
                                                  τ
each client computes the residual as                                             Assumption 1. (L-smoothness) Fe1 , ..., FeN are L-smooth,
                                                                                 and F is also L-smooth.
                  RτkE = ∆ωτkE − ST (∆ωτkE ).                              (5)
                                                                                 Assumption 2. (Strong convexity) Fe1 , ..., FeN are µ-strongly
ST (·) is the compression method presented in Algorithm 2.                       convex, and F is also µ-strongly convex.
Client sends the compressed local updates ST (∆ωτkE ), ukτ ,
and vτk back to the coordinator.                                                 Assumption 3. (Bounded variance) The variance of the
  Thirdly, the server aggregates the next global weight as                       stochastic gradients is bounded by Eξ ||gtk − g kt ||2 ≤ σk2
    G          G       G                                                         Assumption 4. (Bounded gradient) The expected squared
   ω(τ +1)E = ωτ E + ∆ωτ E                                                       norm of the stochastic gradients at each client is bounded
                                           N
                                           X                                     by Eξ ||gtk ||2 ≤ G2 .
             = ωτGE + R(τ
                       G
                          −1)E +                 pkτ ∆ωτkE
                                                                                 Assumption 5. (Bounded aggregation coefficient) The ag-
                                           k=1                             (6)
                                                                                 gregation coefficient has an upper bound, which is given by
                                                        k
                                           N
                                           X           sτ
                                                       X                         pkτ ≤ θpk .
             =   ωτGE   +    G
                            R(τ −1)E   −         pkτ         ητ gτkE+j ,                                                                  PN
                                           k=1         j=0
                                                                                    Assuming E[pkτ ], E[pkτ skτ ], E[(pkτ )2 skτ ], and E[ k=1 pkτ −
                                                                                      PN
                                                                                 2 + k=1 pkτ skτ ] exist for all rounds τ and clients k, and
where RτGE = ∆ωτGE − ST (∆ωτGE ).                                                   PN
                                                                                 E[ k=1 pkτ skτ ] 6= 0. The convergence bound can be derived
   As mentioned in Section 1, clients’ low participation rate
                                                                                 as
in a federated machine learning system is common in real-
ity. EFL mainly focuses on two situations that lead to the                       Theorem 1. Under Assumptions 1 to 5, for learning rate
Table 1: Number of communication rounds required to reach -accuracy. SC refers to strongly convex, NC refers to non-convex, and δ in
MIME bounds Hessian dissimilarity. EFL preserves the optimal statistical rates (first term in SCAFFOLD) while improves the optimization.

                                  Algorithm       Bounded gradient     Convexity      # Com. rounds
                                                                                      G2            G        L
                                 SCAFFOLD                 X               µ-SC        µS    +      √
                                                                                                   µ 
                                                                                                         +   µ
                                                                                               2
                                                                                             G           δ
                                   MIME                   X               µ-SC              µS    +     µ
                                                                                            N σ2         N
                                  VRL-SGD                 ×                NC               S2    +     
                                                                                          2
                                                                                      G      G     L2
                                  FedAMP                  X                NC         LS +  23 + 
                                                                                                   √

                                                                                        G2       L
                                      EFL                 X               µ-SC          µS + µ 
                                                                                                 √

                 16E P
ητ =   µ((τ +1)E+γ)E[ N     k k ,    the EFL satisfies                   Suppose a client a is inactive with the probability 0 < y a <
                       k=1 pτ sτ ]
                                                                      1 in each round, and let f0 (τ ) be the convergence bound if we
                                    Cτ        Hτ J                    keep the client, f1 (τ ) be the bound if it is abandoned at τ0 .
          E||ωτGE − ω ∗ ||2 ≤              +        ,           (7)   For f0 with sufficiently many steps, the first term in Equation
                                (τ E + γ)2   τE + γ
                                                                      (7) shrinks to zero, and the second term converges to y a J.
                                 2       32E(1+θ)L
where γ = max{ min E[4E PNθ k k ,             PN    k k },            Thus, f0 ≈ y a J, and we can obtain that f1 τ = (E(τ −τ  C
                                                                                                                               eτ
                     τ   k=1 pτ sτ ] µ minτ E[ k=1 pτ sτ ]                                                                           γ )2
                                                                                                                                 0 )+e
        Pτ −1
Hτ =     t=0 E[rt ], rt  ∈ {0, 1} indicates the ra-                   for some Cτ and γ
                                                                                 e      e, thus we have
    E[pk  k
       τ sτ ]
tio   pk
               has the same value for all k, Cτ =                     Corollary 1. An inactive client a should be abandoned if
                                          Pτ −1      E[Bt ]
       2
max{γ E||ω0    G
                     − ω ∗ ||2 , ( 16E
                                    µ )
                                        2
                                            t=0 (E[ N pk
                                                   P        k 2 },                                  y a J > f1 (T ).                            (8)
                                                    k=1 t st ])
          PN         k 2 k 2                     N
                                   2(2 + θ)L k=1 pkt skt Dk +
                                               P
Bt =          k=1 (pt ) st σk + P                                     Assuming Cτ ≈ C   eτ = τ C and γ ≈ γ  e, i.e., the removed client
          µ                          N              PN
(2 + 2(1+θ)L )E(E − 1)G ( k=1 pkt skt + θ( k=1 pkt −
                               2
                                                                      does not significantly affect the overall SGD variance and the
         PN        k K               2
                                       PN (pkt )2 k                   degree of non-IID, then Equation (8) can be formulated as
2) +        k=1 pt st ) + 2EG             k=1 pk st ,    J      =
        32E
              PN       k k
                    E[pτ sτ ]                                                                                      C
maxτ { D /µE[k=1PN      k k }.                                                                      y k > O(  ).   J
                                                                                                                                 (9)
         k         k=1 pτ sτ ]
                                                                                                          TE
   Based on Theorem 1, Cτ = O(τ ), which means           ωτGE
                                                         will            From Corollary 1, the more epochs the training on the local
finally converge to a global optimal as τ → ∞ if Hτ increases         client, the more sensitive it is to the in-activeness.
sub-linearly with τ . Table 1 summaries the required num-
ber of communication rounds of SCAFFOLD[Karimireddy                   4.2     Incomplete Client
et al., 2020b], MIME [Karimireddy et al., 2020a], VRL-
                                                                      Based on Theorem 1, the convergence bound is controlled
SGD[Liang et al., 2019], FedAMP[Huang et al., 2021] and
                                                                      by the expectation of pkτ and its functions. EFL al-
EFL. The proposed EFL algorithm achieves tighter bound
                                                                      lows clients to upload partial work with adaptive weight
comparing with methods which assume µ-strongly convex.                              k
The proof of Theorem 1 is summarized in the supplementary             pkτ = Ep    sk
                                                                                      .    It assigns a greater aggregation co-
                                                                                   τ
material.                                                             efficient to clients that complete fewer local epochs,
                                                                      and turns out to guarantee the convergence in the non-
4     Impacts of Irregular Clients                                    IID setting.      The resulting convergence bound follows
                                                                               τ −1                              τ −1
                                                                         E5           N
                                                                                          pk E[1/sk ]+E 2               N
                                                                                                                            (pk σ )2 E[1/sk ]
                                                                             P    P                 P   P
                                                                                                                  k
In this section, we investigate the impacts of clients’ different     O(       t=0  k         t         t=0
                                                                                              (τ E+E 2 )2
                                                                                                             k             t
                                                                                                                             ).
behaviors including being inactive, incomplete, new client ar-           The reason for enlarging the aggregation coefficient lies in
rival and client departure.                                           Equation (6) that increasing pkτ is equivalent to increasing the
                                                                      learning rate of client k. By assigning clients that complete
4.1    Inactive Client                                                fewer epochs a greater aggregation coefficient, these clients
If there exist inactive clients, the convergence rate changes         effectively run further in each local step, compensating for
      P  τ −1
             t  y J                                                   less epochs.
to O( τ t=0
         E+E 2 ). yt indicates if there are inactive clients in
the t-th round. Furthermore, the term converges to zero if            4.3     Client Departure
yt ∈ O(τ ), which means that a mild degree of inactive client         If k-th client quits at τ0 < T , no more updates will be re-
will not discourage the convergence. A client can frequently          ceived from it, and skτ = 0 for all τ > τ0 . As a result, the
become inactive due to the limited resources in reality. Per-                         E[pk sk ]
manently removing the client in this case may improve the             value of ratio pτk τ is different for different k, and rτ = 1
model performance. Specially, we will remove the client if            for all τ > τ0 . According to Theorem 1, ωτGE cannot con-
the system without this client leads to a smaller training loss       verge to the global optimal ω ∗ as HT ≥ T − τ0 . Intuitively,
when it terminates at the deadline T .                                a client should contribute sufficiently many updates in order
Figure 1: Testing Accuracy-Communication Rounds comparisons of VGG11 on CIFAR100 and Resnet50 on EMNIST in a distributed setting
for IID and non-IID data. In the non-IID cases, every client only holds examples from exactly m classes in the dataset. All methods suffer
from degraded convergence speed in the non-IID situation, but EFL is affected by far the least.

            Table 2: BMTA for the non-IID data setting                  to fully address the new client.
                                                                           We also present the bounds of the additional term due to
   Methods       MNIST       CIFAR100        Sent140      Shakes.       the objective shift as
   FedAvg         98.30          2.27          59.14       51.35        Theorem 2. For the global objective shift F → F̂ , ω ∗ → ω̂ ∗ ,
  FedGATE         99.15         80.94          68.84       54.71        let D̂k = Fk (ω̂ ∗ ) − Fk∗ quantify the degree of non-IID data
  VRL-SGD         98.86          2.81          68.62       52.33        with respect to the new objective. If the client a quits the
    APFL          98.49         77.19          68.81       55.27        system,
  FedAMP          99.06         81.17          69.01       58.42                                               8Ln2a D̂a
    EFL           99.10         81.38          68.95       60.49                            ||ω ∗ − ω̂ ∗ ||2 ≤           .        (10)
                                                                                                                µ2 n2
                                                                        If the client a joins the system,
for its features to be captured by the trained model in the non-                                          8Ln2a D̂a
IID setting. After a client leaves, the remaining training steps                        ||ω ∗ − ω̂ ∗ ||2 ≤            ,          (11)
                                                                                                        µ2 (n + na )2
will not keep much memory as it runs more rounds. Thus, the             where n is the total number of samples before the shift.
model may not be applicable to the leaving client, especially
when it leaves early in the training (τ0  T ), which indicates           It can be concluded that the bound reduces when the data
that we may discard a departing client if we cannot guarantee           becomes more IID, and when the changed client owns fewer
the trained model performs well on it, and the earlier a client         data samples.
leaves, the more likely it should be discarded.
   However, removing the departing client (the a client) from           5     Experiments
the training may push the original learning objective F =               In this section, we first demonstrate the effectiveness and effi-
PN                                         PN
          ke                                           ke               ciency of EFL in the non-IID data setting, and compare with
   k=1 p Fk towards the new one F̂ =         k=1,k6=a p Fk , and
                        ∗                           ∗
the optimal weight ω will also shift to some ω̂ that min-               several baseline algorithms. Then, we show the robustness of
                                                                        EFL on the low participation rate challenge.
imizes F̂ . There exists a gap between these two optimal,
which further adds an additional term to the convergence                5.1    Experimental Settings
bound obtained in Theorem 1, and a sufficient number of up-             Both convex and non-convex models are evaluated on a num-
dates are required for ωτGE to converge to the new optimal ω̂ ∗ .       ber of benchmark datasets of federated learning. Specifically,
                                                                        we adopt MNIST [LeCun et al., 1998], EMNIST [Cohen et
4.4   Client Arrival                                                    al., 2017] dataset with Resnet50 [He et al., 2016], CIFAR100
The same argument holds when a new client joins in the train-           dataset [Krizhevsky et al., 2009] with VGG11 [Simonyan
ing, which requires changing the original global objective to           and Zisserman, 2014] network, Shakespeare dataset with an
include the loss on the new client’s data. The learning rate            LSTM [McMahan et al., 2017a]to predict the next character,
also needs to be increased when the objective changes. In-              Sentiment140 dataset [Go et al., 2009] with an LSTM to clas-
tuitively, if the shift happens at a large time τ0 , where ωτGE         sify sentiment and synthetic dataset with a linear regression
approaches to the old optimal ω ∗ and ητ0 is close to zero, re-         classifier.
ducing the latest differences ||ωτGE −ω̂ ∗ ||2 ≈ ||ω ∗ −ω̂ ∗ ||2 with      Our experiments are conducted on the TensorFlow plat-
a small learning rate is inapplicable. Thus, a greater learning         form running in a Linux server. For reference, statistics of
rate should be adopted, which is equivalent to initiate a fresh         datasets, implementation details and the anonymized code are
start after the shift, and there still needs more updating rounds       summarized in supplementary material.
Figure 2: The first row shows Testing Accuracy-Communication Rounds comparison and the second row shows Training Loss-
Communication Rounds comparison in non-IID setting. EFL with elastic term stabilizes and improves the convergence of the algorithm.

                                                                       be observed also for Resnet50 on EMNIST case, it can be
                                                                       concluded that the performance loss that is originated from
                                                                       the non-IID data is not unique to some functions.
                                                                          Aiming at better illustrating the effectiveness of the pro-
                                                                       posed algorithm, we further evaluate and compare EFL with
                                                                       the state-of-the-art algorithms, including FedGATE [Haddad-
                                                                       pour et al., 2021], VRL-SGD [Liang et al., 2019], APFL
                                                                       [Deng et al., 2020] and FedAMP [Huang et al., 2021] on
Figure 3: Testing Accuracy-Communication Rounds comparisons            MNIST, CIFAR100, Sentiment140, and Shakespeare dataset.
among different algorithms in low participation rate. EFL utilizes
incomplete updates from stragglers and is robust to the low partici-
                                                                       The performance of all the methods is evaluated by the best
pation rate.                                                           mean testing accuracy (BMTA) in percentage, where the
                                                                       mean testing accuracy is the average of the testing accuracy
                                                                       on all participants. For each of datasets, we apply a non-IID
5.2   Effects of Non-IID Data                                          data setting.
We run experiments with a simplified version of the well-                 Table 2 shows the BMTA of all the methods under non-
studied 11-layer VGG11 network, which we train on the                  IID data setting, which is not easy for vanilla algorithm Fe-
CIFAR100 dataset in a federated learning setup using 100               dAvg. On the challenging CIFAR100 dataset, VRL-SGD is
clients. For the IID setting, we split the training data ran-          unstable and performs catastrophically because the models
domly into equally sized shards and assign one shard to every          are destroyed such that the customized gradient updates in
clients. For the non-IID (m) setting, we assign every client           the method can not tune it up. APFL and FedAMP train per-
samples from exactly m classes of the dataset. We also per-            sonalized models to alleviate the non-IID data, however, the
form experiments with Resnet50, where we train on EMNIST               performance of APFL is still damaged by unstable training.
dataset under the same setup of the federated learning envi-           FedGATE, FedAMP and EFL achieve comparably good per-
ronment. Both models are trained using SGD.                            formance on all datasets.
   Figure 1 shows the convergence comparison in terms of
gradient evaluations for the two models using different al-            5.3   Effects of the Elastic Term
gorithms. FedProx [Li et al., 2018] incorporates a proximal            EFL utilizes the incomplete local updates, which indicates
term in local objective to improve the model performance on            that clients may perform different amount of local work skτ ,
the non-IID data, SCAFFOLD [Karimireddy et al., 2020b]                 and this parameter together with the elastic term scaled by
adopts control variate to alleviate the effects of data hetero-        λ affect the performance of the algorithm. However, skτ is
geneity, and APFL [Deng et al., 2020] learns personalized              determined by its constrains, i.e., it is a client specific param-
local models to mitigate heterogeneous data on clients.                eter, EFL can only set the maximum number of local epochs
   We observe that while all methods achieve comparably fast           to prevent local models drifting too far away from the global
convergence in terms of gradient evaluations on IID data, they         model, and tune a best λ. Intuitively, a proper λ restricts the
suffer considerably in the non-IID setting. From left to right,        optimization trajectory by limiting the most informative pa-
as data becomes more non-IID, convergence becomes worse                rameters’ change, and guarantees the convergence.
for FedProx, and it can diverge in some cases. SCAFFOLD                   We explore impacts of the elastic term by setting different
and APFL exhibit its ability in alleviating the data hetero-           values of λ, and investigate whether the maximum number
geneity, but are not stable during training. As this trend can         of local epochs influences the convergence behavior of the
algorithm. Figure 2 shows the performance comparison on            [Bernstein et al., 2018] Jeremy Bernstein, Yu-Xiang Wang,
different datasets using different models. We compare the            Kamyar Azizzadenesheli, and Anima Anandkumar.
result between EFL with λ = 0 and EFL with best λ. For               signsgd: Compressed optimisation for non-convex prob-
all datasets, it can be observed that the appropriate λ can in-      lems. arXiv preprint arXiv:1802.04434, 2018.
crease the stability for unstable methods and can force diver-     [Cohen et al., 2017] Gregory Cohen, Saeed Afshar, Jonathan
gent methods to converge, and it also increases the accuracy
                                                                     Tapson, and Andre Van Schaik. Emnist: Extending mnist
in most cases. As a result, setting λ ≥ 0 is particularly useful
                                                                     to handwritten letters. In 2017 International Joint Con-
in the non-IID setting, which indicates that the EFL benefits
                                                                     ference on Neural Networks (IJCNN), pages 2921–2926.
practical federated settings.
                                                                     IEEE, 2017.
5.4    Robustness of EFL                                           [Dai et al., 2019] Xinyan Dai, Xiao Yan, Kaiwen Zhou, Han
Finally, in Figure 3, we demonstrate that EFL is robust to the       Yang, Kelvin KW Ng, James Cheng, and Yu Fan. Hyper-
low participation rate. In particular, we track the convergence      sphere quantization: Communication-efficient sgd for fed-
speed of LSTM trained on Sentiment140 and Shakespeare                erated learning. arXiv preprint arXiv:1911.04655, 2019.
dataset. It can be observed that reducing the participation
                                                                   [Deng et al., 2020] Yuyang Deng, Mohammad Mahdi Ka-
rate has negative effects on all methods. The causes for these
negative effects, however, are different: In FedAvg, the actual      mani, and Mehrdad Mahdavi. Adaptive personalized fed-
participation rate is determined by the number of clients that       erated learning. arXiv preprint arXiv:2003.13461, 2020.
finish the complete training process, because it does not in-      [Draelos et al., 2017] Timothy J Draelos, Nadine E Miner,
clude the incomplete updates. This can steer the optimization        Christopher C Lamb, Jonathan A Cox, Craig M Vineyard,
process away from the minimum and might even cause catas-            Kristofor D Carlson, William M Severa, Conrad D James,
trophic forgetting. On the other hand, low participation rate        and James B Aimone. Neurogenesis deep learning: Ex-
reduces the convergence speed of EFL by causing the clients          tending deep networks to accommodate new classes. In
residuals to go out sync and increasing the gradient staleness.      2017 International Joint Conference on Neural Networks
The more rounds a client has to wait before it is selected to        (IJCNN), pages 526–533. IEEE, 2017.
participate in training again, the more outdated the accumu-
                                                                   [Duchi et al., 2014] John C Duchi, Michael I Jordan, and
lated gradients become.
                                                                     Martin J Wainwright. Privacy aware learning. Journal
                                                                     of the ACM (JACM), 61(6):1–57, 2014.
6     Conclusion
In this paper, we propose EFL as an unbiased FL algorithm          [Go et al., 2009] Alec Go, Richa Bhayani, and Lei Huang.
that can adapt to the statistical diversity issue by making the      Twitter sentiment classification using distant supervision.
most informative parameters less volatile. EFL can be under-         CS224N project report, Stanford, 1(12):2009, 2009.
stood as an alternative paradigm for fair FL, which tackles        [Haddadpour and Mahdavi, 2019] Farzin Haddadpour and
bias that is originated from the non-IID data and the low par-       Mehrdad Mahdavi. On the convergence of local de-
ticipation rate. Theoretically, we provide convergence guar-         scent methods in federated learning. arXiv preprint
antees for EFL when training on the non-IID data at the low          arXiv:1910.14425, 2019.
participation rate. Empirically, experiments support the com-
petitive performance of the algorithm on the robustness and        [Haddadpour et al., 2021] Farzin Haddadpour, Moham-
efficiency.                                                          mad Mahdi Kamani, Aryan Mokhtari, and Mehrdad
                                                                     Mahdavi. Federated learning with compression: Unified
References                                                           analysis and sharp guarantees. In International Con-
                                                                     ference on Artificial Intelligence and Statistics, pages
[Agarwal et al., 2018] Naman Agarwal, Ananda Theertha
                                                                     2350–2358. PMLR, 2021.
  Suresh, Felix Yu, Sanjiv Kumar, and H Brendan Mcma-
  han. cpsgd: Communication-efficient and differentially-          [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
  private distributed sgd. arXiv preprint arXiv:1805.10559,          Ren, and Jian Sun. Deep residual learning for image recog-
  2018.                                                              nition. In Proceedings of the IEEE conference on computer
[Aji and Heafield, 2017] Alham Fikri Aji and Kenneth                 vision and pattern recognition, pages 770–778, 2016.
  Heafield. Sparse communication for distributed gradient          [Huang et al., 2021] Yutao Huang, Lingyang Chu, Zirui
  descent. arXiv preprint arXiv:1704.05021, 2017.                    Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong
[Alistarh et al., 2017] Dan Alistarh, Demjan Grubic, Jerry           Zhang. Personalized cross-silo federated learning on non-
  Li, Ryota Tomioka, and Milan Vojnovic.              Qsgd:          iid data. In Proceedings of the AAAI Conference on Artifi-
  Communication-efficient sgd via gradient quantization              cial Intelligence, 2021.
  and encoding. In Advances in Neural Information Pro-             [Kairouz et al., 2019] Peter Kairouz, H Brendan McMa-
  cessing Systems, pages 1709–1720, 2017.                            han, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
[Basu et al., 2019] Debraj Basu, Deepesh Data, Can                   jun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Gra-
  Karakus, and Suhas Diggavi. Qsparse-local-sgd: Dis-                ham Cormode, Rachel Cummings, et al. Advances and
  tributed sgd with quantization, sparsification, and local          open problems in federated learning. arXiv preprint
  computations. arXiv preprint arXiv:1906.02367, 2019.               arXiv:1912.04977, 2019.
[Karimireddy et al., 2020a] Sai Praneeth Karimireddy, Mar-        [Lin et al., 2018] Tao Lin, Sebastian U Stich, Kumar Kshitij
   tin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi,           Patel, and Martin Jaggi. Don’t use large mini-batches, use
   Sebastian U Stich, and Ananda Theertha Suresh. Mime:              local sgd. arXiv preprint arXiv:1808.07217, 2018.
   Mimicking centralized stochastic algorithms in federated       [McMahan et al., 2017a] Brendan McMahan, Eider Moore,
   learning. arXiv preprint arXiv:2008.03606, 2020.                  Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-
[Karimireddy et al., 2020b] Sai Praneeth Karimireddy,                cas. Communication-efficient learning of deep networks
   Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebas-                 from decentralized data. In Artificial Intelligence and
   tian Stich, and Ananda Theertha Suresh.          Scaffold:        Statistics, pages 1273–1282. PMLR, 2017.
   Stochastic controlled averaging for federated learning. In     [McMahan et al., 2017b] H Brendan McMahan, Daniel Ra-
   International Conference on Machine Learning, pages
                                                                     mage, Kunal Talwar, and Li Zhang. Learning differen-
   5132–5143. PMLR, 2020.
                                                                     tially private recurrent language models. arXiv preprint
[Khaled et al., 2020] Ahmed          Khaled,       Konstantin        arXiv:1710.06963, 2017.
   Mishchenko, and Peter Richtárik. Tighter theory for
                                                                  [Parisi et al., 2018] German I Parisi, Jun Tani, Cornelius We-
   local sgd on identical and heterogeneous data. In Interna-
                                                                     ber, and Stefan Wermter. Lifelong learning of spatiotem-
   tional Conference on Artificial Intelligence and Statistics,
                                                                     poral representations with dual-memory recurrent self-
   pages 4519–4529. PMLR, 2020.
                                                                     organization. Frontiers in neurorobotics, 12:78, 2018.
[Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pas-
                                                                  [Reisizadeh et al., 2020] Amirhossein Reisizadeh, Aryan
   canu, Neil Rabinowitz, Joel Veness, Guillaume Des-
   jardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago            Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin
   Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom-             Pedarsani. Fedpaq: A communication-efficient federated
   ing catastrophic forgetting in neural networks. Proceed-          learning method with periodic averaging and quantization.
   ings of the national academy of sciences, 114(13):3521–           In International Conference on Artificial Intelligence and
   3526, 2017.                                                       Statistics, pages 2021–2031. PMLR, 2020.
[Konečnỳ et al., 2016] Jakub Konečnỳ, H Brendan McMa-         [Rusu et al., 2016] Andrei A Rusu, Neil C Rabinowitz,
   han, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh,        Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,
   and Dave Bacon. Federated learning: Strategies for                Koray Kavukcuoglu, Razvan Pascanu, and Raia Had-
   improving communication efficiency. arXiv preprint                sell.     Progressive neural networks.      arXiv preprint
   arXiv:1610.05492, 2016.                                           arXiv:1606.04671, 2016.
[Krizhevsky et al., 2009] Alex Krizhevsky, Geoffrey Hinton,       [Sahu et al., 2018] Anit Kumar Sahu, Tian Li, Maziar San-
   et al. Learning multiple layers of features from tiny im-         jabi, Manzil Zaheer, Ameet Talwalkar, and Virginia Smith.
   ages. 2009.                                                       On the convergence of federated optimization in hetero-
                                                                     geneous networks. arXiv preprint arXiv:1812.06127, 3,
[LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua                2018.
   Bengio, and Patrick Haffner. Gradient-based learning ap-
   plied to document recognition. Proceedings of the IEEE,        [Shin et al., 2017] Hanul Shin, Jung Kwon Lee, Jaehong
   86(11):2278–2324, 1998.                                           Kim, and Jiwon Kim. Continual learning with deep gener-
                                                                     ative replay. In Advances in neural information processing
[Li et al., 2018] Tian Li, Anit Kumar Sahu, Manzil Zaheer,           systems, pages 2990–2999, 2017.
   Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
   Federated optimization in heterogeneous networks. arXiv        [Simonyan and Zisserman, 2014] Karen Simonyan and An-
   preprint arXiv:1812.06127, 2018.                                  drew Zisserman.       Very deep convolutional networks
                                                                     for large-scale image recognition.          arXiv preprint
[Li et al., 2019] Xiang Li, Kaixuan Huang, Wenhao Yang,
                                                                     arXiv:1409.1556, 2014.
   Shusen Wang, and Zhihua Zhang. On the convergence of
   fedavg on non-iid data. arXiv preprint arXiv:1907.02189,       [Soltoggio, 2015] Andrea Soltoggio. Short-term plasticity as
   2019.                                                             cause–effect hypothesis testing in distal reward learning.
                                                                     Biological cybernetics, 109(1):75–94, 2015.
[Li et al., 2020a] Tian Li, Anit Kumar Sahu, Ameet Tal-
   walkar, and Virginia Smith. Federated learning: Chal-          [Stich, 2018] Sebastian U Stich.            Local sgd con-
   lenges, methods, and future directions. IEEE Signal Pro-          verges fast and communicates little. arXiv preprint
   cessing Magazine, 37(3):50–60, 2020.                              arXiv:1805.09767, 2018.
[Li et al., 2020b] Zhize Li, Dmitry Kovalev, Xun Qian, and        [Tsuzuku et al., 2018] Yusuke Tsuzuku, Hiroto Imachi, and
   Peter Richtárik. Acceleration for compressed gradient            Takuya Akiba. Variance-based gradient compression
   descent in distributed and federated optimization. arXiv          for efficient distributed deep learning. arXiv preprint
   preprint arXiv:2002.11364, 2020.                                  arXiv:1802.06058, 2018.
[Liang et al., 2019] Xianfeng Liang,         Shuheng Shen,        [Wang and Joshi, 2018] Jianyu Wang and Gauri Joshi. Co-
   Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng.            operative sgd: A unified framework for the design and
   Variance reduced local sgd with lower communication               analysis of communication-efficient sgd algorithms. arXiv
   complexity. arXiv preprint arXiv:1912.12844, 2019.                preprint arXiv:1808.07576, 2018.
[Zhao et al., 2018] Yue Zhao, Meng Li, Liangzhen Lai,
  Naveen Suda, Damon Civin, and Vikas Chandra. Fed-
  erated learning with non-iid data.           arXiv preprint
  arXiv:1806.00582, 2018.
[Zhou and Cong, 2017] Fan Zhou and Guojing Cong. On
  the convergence properties of a k-step averaging stochas-
  tic gradient descent algorithm for nonconvex optimization.
  arXiv preprint arXiv:1708.01012, 2017.
[Zhu et al., 2020] Wennan Zhu, Peter Kairouz, Brendan
  McMahan, Haicheng Sun, and Wei Li. Federated heavy
  hitters discovery with differential privacy. In International
  Conference on Artificial Intelligence and Statistics, pages
  3837–3847. PMLR, 2020.
You can also read