Impact of Barren Plateaus on the Hessian and Higher Order Derivatives

Page created by Stanley Wright
 
CONTINUE READING
Impact of Barren Plateaus on the Hessian and Higher Order Derivatives

                                                                                      M. Cerezo1, 2 and Patrick J. Coles1
                                                                 1
                                                                   Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
                                                             2
                                                                 Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM, USA
                                                           The Barren Plateau (BP) phenomenon is an issue for certain quantum neural networks and
                                                         variational quantum algorithms, whereby the gradient vanishes exponentially in the system size n.
                                                         The question of whether high-order derivative information such as the Hessian could help escape a BP
                                                         was recently posed in the literature. Here we show that the elements of the Hessian are exponentially
                                                         suppressed in a BP, so estimating the Hessian in this situation would require a precision that scales
                                                         exponentially with n. Hence, Hessian-based approaches do not circumvent the exponential scaling
                                                         associated with BPs. We also show the exponential suppression of higher order derivatives. Hence,
                                                         BPs will impact optimization strategies that go beyond (first-order) gradient descent.
arXiv:2008.07454v1 [quant-ph] 17 Aug 2020

                                                                  I.   Introduction                          was to compute the Hessian H of the cost function, and
                                                                                                             the claim was that taking a learning rate proportional to
                                                                                                             the inverse of the largest eigenvalue of the Hessian leads
                                               Training parameterized quantum circuits is a promis-
                                                                                                             to an optimization method that could escape the barren
                                            ing approach for quantum computing in the Noisy In-
                                                                                                             plateau.
                                            termediate Scale Quantum (NISQ) era. This strategy is
                                            employed in two closely related paradigms: Variational              The question of whether higher-order derivative infor-
                                            Quantum Algorithms (VQAs) for chemistry, optimiza-               mation (beyond the first-order gradient) is useful for es-
                                            tion, and other applications [1–12], and Quantum Neural          caping a barren plateau is interesting and is the subject of
                                            Networks (QNNs) for classification applications [13–16].         our work here. Our main results are presented here in the
                                            In both cases, one utilizes a quantum computer to ef-            form of two propositions and corollaries. First, we show
                                            ficiently evaluate a cost (or loss) function C(θ) or its         that the matrix elements Hij of the Hessian are exponen-
                                            gradient ∇C(θ), while employing a classical optimizer            tially vanishing when the cost exhibits a barren plateau.
                                            to train the parameters θ of a parameterized quantum             This implies that the calculation of Hij requires expo-
                                            circuit V (θ). While many novel VQAs and QNNs have               nential precision. In our second result we show that the
                                            been developed, more rigorous scaling analysis is needed         magnitude of any higher-order partial derivative of the
                                            for these architectures.                                         cost will also be exponentially small in a barren plateau.
                                                                                                             Our results suggest that optimization methods that use
                                               One of the few known results is the so-called bar-            higher-order derivative information, such as the Hessian,
                                            ren plateau phenomenon [17–20], where the cost func-             will also face exponential scaling, and hence do not cir-
                                            tion gradient vanishes exponentially with the system size.       cumvent the scaling issues arising from barren plateaus.
                                            This can arise due to deep unstructured ansatzes [17, 19],
                                            global cost functions [18, 19], or noise [20]. Regardless
                                            of the origin, when a cost landscape exhibits a barren                                 II.   Preliminaries
                                            plateau, one requires an exponential precision to deter-
                                            mine a minimizing direction in order to navigate the land-
                                            scape. Since the standard goal of quantum algorithms is            In what follows, we consider the case when the cost
                                            polynomial scaling with the system size (in contrast to          can be expressed as a sum of expectation values:
                                            the exponential scaling of classical algorithms), the expo-                 N
                                            nential scaling due to barren plateaus can destroy quan-           C(θ) =
                                                                                                                        X
                                                                                                                              Cx , with Cx = Tr[Ox V (θ)ρx V † (θ)] , (1)
                                            tum speedup. Hence, the study and analysis of barren
                                                                                                                        x=1
                                            plateaus should be viewed as a fundamental step in the
                                            development of VQAs and QNNs to guarantee that they              where {ρx } is a set (of size N ) of input states to the
                                            can, in fact, provide a speedup over classical algorithms.       parameterized circuit V (θ). In order for this cost to be
                                               Recently there have been multiple strategies proposed         efficiently computable, the number of states in the input
                                            for avoiding barren plateaus such as employing local             set should grow at most polynomially with the number
                                            cost functions [18], pre-training [21], parameter corre-         of qubits n, that is, N ∈ O(poly(n)). In the context
                                            lation [22], layer-by-layer training [23], and initializing      of QNNs, the states {ρx } can be viewed as training data
                                            layers to the identity [24]. These strategies are aimed at       points, and hence (1) is a natural cost function for QNNs.
                                            either avoiding or preventing the existence of a barren          In the context of VQAs, one typically chooses N = 1,
                                            plateau, and they appear to be promising, with more re-          corresponding to a single input state. In this sense, the
                                            search needed on their efficacy on general classes of prob-      cost function in (1) is general enough to be relevant to
                                            lems. In a recent manuscript [25], an alternative idea was       both QNNs and VQAs.
                                            proposed involving a method for actually training inside            Let θi be an angle which parameterizes a unitary in
                                            and escaping a barren plateau. Specifically, the proposal        V (θ) as e−iθi σi /2 , with σi a Hermitian operator with
2

eigenvalues ±1. Then, the partial derivative ∂C(θ)   ∂θi =              From ChebyshevâĂŹs inequality we can bound the prob-
∂i C(θ) can be computed via the parameter shift rule [26,               ability that the cost derivative deviates from its mean
27] as                                                                  value (of zero) as
                 1        (1)
                                   
                                           (− 1 )
                                                  
                                                                                                             Varθ [∂i C]
       ∂i C(θ) =   C θ i , θi 2 − C θ i , θi 2        , (2)                              Pr (|∂i C| > c) 6               ,                (9)
                 2                                                                                              c2
where we define                                                         for all c > 0, and for all i. Then, let E±be defined
                                                                                                                             as the
                          (δ)                                                                                                (± 12 )
                         θi     = θi + δπ.                       (3)    event that |∂i C(θ ± )| > c, where θ ± = θ j , θj              . Note
Note that the parameter shift rule in (2) allows one to                 that the set of events where |Hij | > c is a subset of the
exactly write the first-order partial derivative as a dif-              set E+ ∪ E− . Then, from the union bound and Eq. (9) we
ference of cost function values evaluated at two different              can recover (7) as follows:
points.
                                                                             Pr(|Hij | > c) 6 Pr(E+ ∪ E− )                               (10)
                                                                                            6 Pr(E+ ) + Pr(E− )                          (11)
              III.   Hessian matrix elements                                                  Varθ+ [∂i C] Varθ− [∂i C]
                                                                                            6              +                             (12)
                                                                                                  c2            c2
   Let us now state our results for the Hessian. The Hes-                                     2Varθ [∂i C]
sian H of the cost function is a square matrix whose                                        =              ,                             (13)
                                                                                                  c2
matrix elements are the second derivatives of C(θ), i.e.,
                                                                        where we used the fact that h·iθ = h·iθ± .
                      ∂ 2 C(θ)
                Hij =          = ∂i ∂j C(θ) .                    (4)
                      ∂θi ∂θj                                             When the cost exhibits a barren plateau as in [17–
                                                                        19], the variance of the cost function partial derivative
Reference [25] noted that the matrix elements of the Hes-               Varθ [∂i C] is exponentially vanishing as
sian can be written according to the parameter shift rule.
Namely, one can first write                                                 Varθ [∂i C] 6 F (n) ,    with F (n) ∈ O(1/bn ).              (14)
             1h      
                             (1)
                                      
                                               (− 1 )
                                                      i
       Hij =     ∂i C θ j , θj 2 − ∂i C θ j , θj 2       (5)            for some b > 1. Then, the following corollary holds.
             2
and then apply the parameter shift rule a second time:                  Corollary 1. Consider the bound in Eq. (10) of Proposi-
                                                                    tion 1. If the cost exhibits a barren plateau, such that (14)
          1h           (1)    (1)             (− 1 )  (− 1 )            holds, then the matrix elements of the Hessian are expo-
  Hij =      C θ ij , θi 2 , θj 2 + C θ ij , θi 2 , θj 2          (6)
          4                                                             nentially vanishing since
                                                          i
                       (1)    (− 1 )           (− 1 )  (1)
           −C θ ij , θi 2 , θj 2 − C θ ij , θi 2 , θj 2         .                                             2F (n)
                                                                                           Pr(|Hij | > c) 6          ,                   (15)
                                                                                                                c2
Now, the second derivatives of the cost can be expressed
as a sum of cost functions being evaluated at (up to) four              where F (n) ∈ O(1/bn ) for some b > 1.
points.
   From the parameter shift rule we can then derive the                    The proof follows by combining (7) and (14). Corol-
following bound on the probability that the magnitude of                lary 1 shows that when the cost landscape exhibits a
the matrix elements |Hij | are larger than a given c > 0.               barren plateau, the matrix elements of the Hessian are
                                                                        exponentially vanishing with high probability. This im-
Proposition 1. Consider a cost function of the                          plies that any algorithm that requires the estimation of
form (1), for which the parameter shift rule of (2) holds.              the Hessian will requires a precision that grows exponen-
Let Hij be the matrix elements of the Hessian as defined                tially with the system size.
in (5). Then, assuming that h∂i Ciθ = 0, the following
inequality holds for any c > 0,
                                    2Varθ [∂i C]                                 IV.     Higher order partial derivatives
                Pr(|Hij | > c) 6                 .               (7)
                                        c2
                                                                           Let us now analyze the magnitude of higher order par-
Here Varθ [∂i C] = h(∂i C)2 iθ − h∂i Ci2θ , where the expec-            tial derivatives in a barren plateau. We use the following
tation values are taken over θ.                                         notation for the |α|th-order derivative
Proof. Equation (5) implies that the magnitudes of the
                                                                                       Dα C(θ) = ∂α1 ∂α2 · · · ∂α|α| C(θ) ,              (16)
Hessian matrix elements are bounded as
          1      
                         (1)
                                    
                                             (− 1 )
                                                    
                                                                        where α is an |α|-tuple. Since one can take the deriva-
 |Hij | 6    ∂i C θ j , θj 2   + ∂i C θ j , θj 2       . (8)
          2                                                             tive with respect to the same angle multiple times, we
3

                                                                              M
                                                                              Y                                    X
                                                                       Wω =         d(ωl ,Nl )   such that             |Wω | = 2|α| . (19)
                                                                              l=1                                  ω

                                                                     Also, ω = (ω1 , . . . , ωD ), where ωl ∈ {0, ±1} if Nl is even,
                                                                     and ωl ∈ {± 12 , ± 32 } if Nl is odd. Additionally, the coeffi-
                                                                     cients dωl ,Nk can be obtained from the Pascal tree which
                                                                     we introduce in Fig. 1. In the Appendix we provide ad-
                                                                     ditional details regarding the coefficients d(ωl ,Nl ) and the
                                                                     Pascal tree.
                                                                        From (18) we obtain that the (|α| + 1)th-order deriva-
                                                                     tive, which we denote as ∂i Dα C(θ) = Di,α C(θ), is ob-
                                                                     tained as the sum of (up to) 2|α| partial derivatives:
                                                                                                 1 X
FIG. 1. The Pascal tree. a) The Pascal tree can be ob-                        Di,α C(θ) = |α|           Wω ∂i C(Θ, Θ(ω) ) .     (20)
tained by modifying how a Pascal triangle is constructed. In                                   2     ω
a Pascal triangle each entry of a row is obtained by adding to-
gether the numbers directly above to the left and above to the       Since one has to individually evaluate each term in (20)
right, with blank entries considered to be equal to zero. The        and since there are up to 2|α| terms, we henceforth as-
entries of a Pascal tree are obtained following the aforemen-        sume that |α| ∈ O(log(n)). This guarantees that the
tioned rule, with the additional constraint that the width of        computation of Di,α C(θ) leads to an overhead which is
the triangle is restricted to always being smaller than a given      (at most) O(poly(n)).
even number. Moreover, once an entry in a row is outside                The following proposition, which generalizes Proposi-
the maximum width, its value is added to the central entry           tion 1, allows us to bound the probability that the mag-
in that row (see arrows). Here the maximum width is four.            nitude of Di,α C(θ) is larger than a given c > 0.
b) The coefficients d(ωl ,Nl ) in (19) can be obtained from the
Pascal tree of (a) by adding signs to the entries of the tree.       Proposition 2. Consider a cost function of the
As schematically depicted, all entries in a diagonal going from      form (1), for which the parameter shift rule of (2) holds.
top left to bottom right have the same sign, with the first en-      Let Di,α C(θ) be a higher order partial derivative of the
try in the first row having a positive sign. Here, each row          cost as defined in (16). Then, assuming that h∂i Ciθ = 0,
corresponds to a given Nl , while entries in a row correspond        the following inequality holds for any c > 0,
to different values of ωl , with ωl ∈ {0, ±1} if Nl is even, and
ωl ∈ {± 12 , ± 23 } if Nl is odd. For instance, d(− 1 ,5) = −12.                                               2|α| Varθ [∂i C]
                                                2
                                                                               Pr( Di,α C(θ) > c) 6                             .    (21)
                                                                                                                     c2
                                                                     Proof. From Eq. (20) we can obtain the following bound
define the set Θ (of size M = |Θ|) as the set of dis-
tinct angles with respect to which we take the partial
derivative. Similarly, let Θ be the compliment of Θ, so                                          1 X
                                                                          Di,α C(θ) 6                  |Wω | ∂i C(Θ, Θ(ω) ) .        (22)
that Θ ∪ Θ = θ. Then, for any Θk ∈ Θ we define Nk                                           2|α|   ω
                                          PM
as the multiplicity of Θk in α such that k=1 Nk = |α|.
Since the cost function and any of its higher order partial          Let us define Eω as the event that ∂i C(Θ, Θ(ω) ) > c.
derivatives are continuous function of the parameters (as            Since (14) holds, then the following chain of inequalities
can be seen below via multiple applications of the param-            holds
eter shift rule), one can extend Clairaut’s Theorem [28]                                                   !
to rewrite                                                                      i,α
                                                                                                     [
                                                                         Pr( D C(θ) > c) 6 Pr           Eω                (23)
                          N1        NM                                                                         ω
               Dα C(θ) = ∂Θ1
                             · · · ∂ΘM
                                       C(θ) .                 (17)                                     X
                                                                                                   6        Pr (Eω )                 (24)
   Then, applying the parameter shift rule |α| times we                                                 ω
find that the |α|-order partial derivative can be expressed                                            X Var(Θ,Θ(ω) ) [∂i C]
as a summation of cost functions evaluated at (up to) 2|α|                                         6                                 (25)
                                                                                                        ω
                                                                                                                       c2
points as
                                                                                                        |α|
                                                                                                       2      Varθ [∂i C]
                          1 X                                                                      6                      ,          (26)
           Dα C(θ) =               Wω C(Θ, Θ   (ω)
                                                     ).       (18)                                             c2
                        2|α|   ω                                     where we invoked the union bound, and where we recall
                                                                     that h·iθ = h·i(Θ,Θ(ω) ) , ∀ω. In addition, for (26) we used
                                      (ω )          (ω    )
Here we defined Θ(ω) = (Θ1 1 , . . . , ΘM M ), with                  the fact that the summation in (25) has at most 2|α|
 (ω )
Θk k = Θk + ωk π defined analogously to (3), and where               terms.
4

  Then, if the cost function exhibits a barren plateau,                              V.    Discussion
the following corollary follows.
                                                                   In this work, we investigated the impact of barren
Corollary 2. Consider the bound in Eq. (21) of Proposi-         plateaus on higher order derivatives. This issue was im-
tion 2. If the cost exhibits a barren plateau, such that (14)   portant in light of a recent proposal to use to higher
holds, then higher order partial derivatives of the cost        order derivative information to escape a barren plateau.
function are exponentially vanishing since                      We considered a cost function C that is relevant to both
                                                                VQAs and QNNs, as barren plateaus are relevant to both
                                       G(n)                     of these applications.
              Pr( Di,α C(θ) > c) 6          ,           (27)
                                        c2                         Our main result was that, when a barren plateau ex-
                                                                ists, the Hessian and other high order partial derivatives
where G(n) ∈ O(1/q n ) for some q > 1.                          of C are exponentially vanishing in n with high probabil-
                                                                ity. Our proof relied on the parameter shift rule, which
Proof. Combining (14) and (21) leads to                         we showed can be applied iteratively to relate higher or-
                                                                der partial derivatives to the first order partial derivative
                                     2|α| F (n)                 (analogous to what Ref. [25] did for the Hessian). Hence,
            Pr( Di,α C(θ) > c) 6                .       (28)    the parameter shift rule allowed us to state the vanishing
                                        c2
                                                                of higher order derivatives as essentially a corollary of the
                                                                vanishing of the first order derivative. We remark that
Then, let us define G(n) = 2|α| F (n). Since |α| ∈
                                                                iterative applications of the parameter shift rule led us
O(log(n)), and F (n) ∈ O(1/bn ), then we know that there
                                                                to a mathematically interesting construct that we called
exists κ, κ0 , and n0 such that ∀n > n0 we respectively
                                0                               the Pascal tree, depicted in Fig. 1.
have 2|α| 6 nκ and F (n) 6 bκn . Combining these two               Our results imply that estimating higher order par-
results we find                                                 tial derivatives in a barren plateau is exponentially hard.
                                                                Hence, any optimization strategy that requires informa-
                    κ0 nκ    κ0
           G(n) 6      n
                          = L(n) ,     ∀n > n0 ,        (29)    tion about partial derivatives that go beyond first-order
                     b     b                                    (such as the Hessian) will require a precision that grows
                                                                exponentially with n. We therefore surmise that, by
where L(n) = (n − κ logb (n)). Equation (29) shows that
                                                                themselves, optimizers that go beyond first order gra-
G(n) ∈ O(1/bL(n) ). Then, since
                                                                dient descent do not appear to be a feasible solution to
                                                                the barren plateau problem. More generally, our results
                         L(n)                                   suggest that it is better to develop strategies that avoid
                      lim     = 1,                      (30)
                      n→∞ n                                     the appearance of the barren plateau altogether, rather
                                                                than to try to escape an existing barren plateau.
we have L(n) ∈ Ω(n), meaning that there exists a κ      b>0        Acknowledgements.—We thank Kunal Sharma for
and nb0 such that ∀n > n    b0 , we have L(n) > κ   bn. The     helpful discussions. Research presented in this article
                          0
latter implies G(n) 6 bκκb n for all n > max{n0 , n
                                                  b0 }, which   was supported by the Laboratory Directed Research and
                              n             κ
means that G(n) ∈ O(1/q ) where q = b . Also, q > 1
                                            e                   Development program of Los Alamos National Labora-
follows from b > 1 and κ e > 0.                                 tory under project number 20180628ECR. MC was also
                                                                supported by the Center for Nonlinear Studies at LANL.
                                                                PJC also acknowledges support from the LANL ASC Be-
   Corollary 2 shows that, in a barren plateau, the mag-        yond Moore’s Law project. This work was also supported
nitude of any efficiently computable higher order par-          by the U.S. Department of Energy (DOE), Office of Sci-
tial derivative (i.e., any partial derivative where |α| ∈       ence, Office of Advanced Scientific Computing Research,
O(log(n))) is exponentially vanishing in n with high            under the Accelerated Research in Quantum Computing
probability.                                                    (ARQC) program.

 [1] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q.          approximate optimization algorithm,” arXiv:1411.4028
     Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien,           [quant-ph].
     “A variational eigenvalue solver on a photonic quantum      [4] J. Romero, J. P. Olson, and A. Aspuru-Guzik, “Quantum
     processor,” Nature Communications 5, 4213 (2014).               autoencoders for efficient compression of quantum data,”
 [2] Jarrod R McClean, Jonathan Romero, Ryan Babbush,                Quantum Science and Technology 2, 045001 (2017).
     and Alán Aspuru-Guzik, “The theory of variational           [5] S. Khatri, R. LaRose, A. Poremba, L. Cincio, A. T. Sorn-
     hybrid quantum-classical algorithms,” New Journal of            borger, and P. J. Coles, “Quantum-assisted quantum
     Physics 18, 023023 (2016).                                      compiling,” Quantum 3, 140 (2019).
 [3] E. Farhi, J. Goldstone, and S. Gutmann, “A quantum          [6] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and
5

       P. J. Coles, “Variational quantum state diagonalization,”     [24] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski,
       npj Quantum Information 5, 1–10 (2018).                            and Marcello Benedetti, “An initialization strategy for
 [7]   A. Arrasmith, L. Cincio, A. T. Sornborger, W. H. Zurek,            addressing barren plateaus in parametrized quantum cir-
       and P. J. Coles, “Variational consistent histories as a hy-        cuits,” Quantum 3, 214 (2019).
       brid algorithm for quantum foundations,” Nature com-          [25] Patrick Huembeli and Alexandre Dauphin, “Characteriz-
       munications 10, 3438 (2019).                                       ing the loss landscape of variational quantum circuits,”
 [8]   Marco Cerezo, Alexander Poremba, Lukasz Cincio, and                arXiv preprint arXiv:2008.02785 (2020).
       Patrick J Coles, “Variational quantum fidelity estima-        [26] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii,
       tion,” Quantum 4, 248 (2020).                                      “Quantum circuit learning,” Phys. Rev. A 98, 032309
 [9]   Cristina Cirstoiu, Zoe Holmes, Joseph Iosue, Lukasz Cin-           (2018).
       cio, Patrick J Coles, and Andrew Sornborger, “Vari-           [27] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh
       ational fast forwarding for quantum simulation beyond              Izaac, and Nathan Killoran, “Evaluating analytic gra-
       the coherence time,” arXiv preprint arXiv:1910.04292               dients on quantum hardware,” Physical Review A 99,
       (2019).                                                            032331 (2019).
[10]   Carlos Bravo-Prieto, Ryan LaRose, M. Cerezo, Yigit            [28] James Stewart, Multivariable calculus (Nelson Educa-
       Subasi, Lukasz Cincio, and Patrick J. Coles, “Variational          tion, 2015).
       quantum linear solver: A hybrid algorithm for linear sys-
       tems,” arXiv:1909.05820 (2019).
[11]   Xiaosi Xu, Jinzhao Sun, Suguru Endo, Ying Li, Simon C              Appendix A         Explicit description of d(ωl ,Nk )
       Benjamin, and Xiao Yuan, “Variational algorithms for
       linear algebra,” arXiv preprint arXiv:1909.03898 (2019).
[12]   M Cerezo, Kunal Sharma, Andrew Arrasmith, and                   In this appendix we first discuss how the parameter
       Patrick J Coles, “Variational quantum state eigensolver,”     shift rule leads to the Pascal tree. Then, we provide
       arXiv preprint arXiv:2004.01372 (2020).                       analytical formulas for d(ωl ,Nk ) .
[13]   Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione,        Let us consider the first and second order partial
       “The quest for a quantum neural network,” Quantum In-         derivatives of the cost function with respect to the same
       formation Processing 13, 2567–2586 (2014).                    angle. From the parameter shift rule of Eq. (2) we find
[14]   Iris Cong, Soonwon Choi, and Mikhail D Lukin, “Quan-                                                        i
       tum convolutional neural networks,” Nature Physics 15,                      1h           (1)            (− 1 )
                                                                       ∂i C(θ) =      C θ i , θ i 2 − C θ i , θi 2
       1273–1278 (2019).                                                           2
[15]   Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, To-
                                                                                      |   {z        } |     {z        }
                                                                                        ×d 1    =1        ×d              =−1
                                                                                          ( ,1)                (− 1 ,1)
       bias J Osborne, Robert Salzmann, Daniel Scheiermann,                                2                      2

       and Ramona Wolf, “Training deep quantum neural net-                         1h 
                                                                                             (1)
                                                                                                     
                                                                                                              (−1)
                                                                                                                                i
                                                                       ∂i2 C(θ) =   C θ i , θi     + C θ i , θi      −2 C (θ)      .
       works,” Nature Communications 11, 1–6 (2020).                              4 |                                    | {z }
                                                                                                                       ×d
                                                                                        {z       } |      {z       }         =−2
[16]   Guillaume Verdon, Jason Pye, and Michael Broughton,                              ×d(1,2) =1       ×d(−1,2) =1            (0,2)

       “A universal training algorithm for quantum deep learn-
       ing,” arXiv preprint arXiv:1806.09729 (2018).                 where we can see that |d(0,2) | = |d(− 12 ,1) | + |d( 12 ,1) | = 2.
[17]   Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy,          Similarly, if we were to take the third partial derivative
       Ryan Babbush, and Hartmut Neven, “Barren plateaus             with respect to i we would find |d(±1/2,3) | = |d(0,2) | +
       in quantum neural network training landscapes,” Nature        |d(±1,2) |, and |d(±3/2,3) | = |d(±1,2) |. Note that this proce-
       communications 9, 4812 (2018).                                dure forms the first four rows of the Pascal tree, which ac-
[18]   M Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio,
                                                                     tually coincide with first four rows of the Pascal triangle.
       and Patrick J Coles, “Cost-function-dependent barren
       plateaus in shallow quantum neural networks,” arXiv           When taking the fourth partial    derivative  we 
                                                                                                                        have to take 
                                                                                                              (−2)               (2)
       preprint arXiv:2001.00550 (2020).                             into account the fact that C θ i , θi          = C θ i , θi       =
[19]   Kunal Sharma, M Cerezo, Lukasz Cincio,                 and
       Patrick J Coles, “Trainability of dissipative perceptron-
                                                                     C (θ), since e−iθσ/2 is equal to e−i(θ+2π)σ/2 up to an
       based quantum neural networks,” arXiv preprint                unobservable global phase. Hence, the fact that θ ≡
       arXiv:2005.12458 (2020).                                      θ(2) (mod 2π) imposes a restriction on the width of the
[20]   Samson Wang, Enrico Fontana, M Cerezo, Kunal                  Pascal tree. Hence, following this procedure one can re-
       Sharma, Akira Sone, Lukasz Cincio, and Patrick J Coles,       cover the entries in Fig. 1.
       “Noise-induced barren plateaus in variational quantum           For arbitrary ωl and Nl , the coefficients dωl ,Nk can
       algorithms,” arXiv preprint arXiv:2007.14384 (2020).          be analytically obtained as follows: If Nk < 2 we have
[21]   Guillaume Verdon, Michael Broughton, Jarrod R Mc-             d(±1,0) = 0, d0,0) = 1, d(±1/2,1) = ±1, and d(±3/2,1) = 0.
       Clean, Kevin J Sung, Ryan Babbush, Zhang Jiang, Hart-         Then, for Nk > 2
       mut Neven, and Masoud Mohseni, “Learning to learn
       with quantum neural networks via classical neural net-                              Nk
                                                                                    
       works,” arXiv preprint arXiv:1907.05415 (2019).
                                                                                    
                                                                                     (−1) 2 2Nk −1           if ωl = 0 ,
                                                                                    ±(−1) Nk2−1 3 · 2Nk −3
                                                                                    
                                                                                    
[22]   Tyler Volkoff and Patrick J Coles, “Large gradients via                                                if ωl = ±1/2 ,
       correlation in random parameterized quantum circuits,”          d(ωl ,Nk ) =        Nk −2
                                                                                                  Nk −2
                                                                                     (−1)   2   2            if ωl = ±1
       arXiv preprint arXiv:2005.12200 (2020).
                                                                                    
                                                                                            Nk −1
                                                                                                      −3
                                                                                                  N
                                                                                      ∓(−1) 2 2 k             if ωl = ±3/2 ,
                                                                                    
[23]   Andrea Skolik, Jarrod R McClean, Masoud Mohseni,
       Patrick van der Smagt, and Martin Leib, “Layerwise
                                                                                                              |d(ωl ,Nk ) | = 2Nk .
                                                                                                     P
       learning for quantum neural networks,” arXiv preprint           Note that ∀Nl we have             ωl
       arXiv:2006.14904 (2020).
You can also read